StringMatch Pro - Extract terminology and repetitive strings with ease



Deep analysis in Repetitions mode

StringMatch Pro analyzes the file using a two-pass process.
Once Pass 1 has found all repetitive strings, Pass 2 performs deep analysis on Pass 1 results to identify their internal structure, looking for repetitions within, and overlaps between, Pass 1 results. StringMatch uses this fine-grain information to present final results in an intelligent and meaningful way.

As a simple example for deep analysis, consider this 16-word source string:

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

StringMatch Pro will find that the string  1 2 3 4 1 2 3 4 1 2 3 4  occurs twice: one starting at the beginning, the other one at the second '1 2 3 4'.
This would be correct, but not very intelligent. Pass 2 analysis will find the  1 2 3 4  cycle, so the program will report that '1 2 3 4' occurs 4 times--a much more meaningful result.

Fuzzy matching

StringMatch Pro provides fuzzy matching with two different mechanisms: the alphabet file and the notion of 'synonyms'.

Alphabet file
The alphabet file is a user-specified text file listing all characters you want StringMatch Pro to accept as parts of words. Any characters of the text file not present in the alphabet file will be discarded, so e.g. if the alphabet file does not contain numbers, then strings differing only in numbers will be reported as identical.
If, for example, the alphabet file contains no numbers, then the strings '6 men', '200 men', 'men 23' and 'men' will be reported as identical.

Synonyms file:
The synonyms file is a user-specified text file listing strings of synonyms.
The way the program uses synonyms depends on the mode it is in.

In Repetitions mode, strings differing only in synonyms will be reported as identical.
For instance, if 'phone' and 'fax' are defined as synonyms of 'device', then the strings 'This is a phone' and 'This is a fax' will be reported as 'This is a device' occuring twice.

In Terms mode, synonyms are discarded.
This mode enables you to extract the teminology of your text file. Simply specify unneeded words as synonyms, and the output will only contain strings of words you judged to be important. This elimination process is made painless by useful sorting options that can be used to lift whole blocks of unneeded words from the output into the synonyms file, see the example.