StringMatch Pro - Extract terminology and repetitive strings with ease

   

Terminology extraction -- A hands-on example

Lwet's assume we want to extract the terminology of this file:

Surgeon Console
Using the da Vinci Surgical System, the surgeon operates seated comfortably at a console while viewing a high definition, 3D image inside the patient's body.
The surgeon's fingers grasp the master controls below the display with hands and wrists naturally positioned relative to his or her eyes.
The system seamlessly translates the surgeon's hand, wrist and finger movements into precise, real-time movements of surgical instruments.

Patient-side Cart
The patient-side cart is where the patient is positioned during surgery. It includes either three or four robotic arms that carry out the surgeon's commands.
The robotic arms move around fixed pivot points which reduces trauma to the patient, improves the cosmetic outcome, and increases overall precision.
The system requires that every surgical maneuver be under the direct control of the surgeon. Repeated safety checks prevent any independent movement of the instruments or robotic arms.

We need to specify an alphabet file for StringMatch Pro, which could look like this:

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
'

For the sake of this example, I assume that we don't care about numbers, so the alphabet file doesn't contain any.
Also, the alphabet file doesn't need to look this orderly (although it doesn't hurt...).

We can now start the program, set Mode to Terms and Len to 1. I also unchecked Case, checked Bare, and selected Length for output sorting.
Results of the first run (no words specified to be discarded yet):

instruments independent comfortably positioned translates seamlessly increases movements naturally patient's precision surgeon's operates maneuver surgical relative cosmetic improves movement repeated commands requires controls includes robotic fingers surgery display patient outcome surgeon precise viewing overall control prevent console reduces system either master seated safety during finger direct checks wrists points inside trauma around image which vinci while where wrist below under using every three pivot grasp fixed hands carry real body move that arms four cart side with eyes time hand high into and his the any out her at or to da it is of be a d

Results are sorted descending by length, so all short words are at the end.
Time to create a synonyms file listing words to discard, so let's create one. In a first step, let us discard short words we don't need. Most of the trailing part of the output starting with 'into' qualifies, but we should leave out 'da' (part of the product name 'da Vinci'. (If you wonder what the last word in the output is, it is from the source string '3D', but remember, we didn't enter numbers in the alphabet file. '3D' may be a keeper in other circumstances, but it is not specific to our text, so I'll discard it for this analysis.)
So our first incarnation of the synonyms file could look like this:

*trash1 into and his the any out her at or to it is of be a d

Running the program with this synonyms file specified will yield:

independent comfortably instruments translates seamlessly positioned definition increases surgeon's patient's precision movements naturally improves includes movement cosmetic operates surgical requires maneuver relative controls repeated commands surgeon control console display overall outcome viewing reduces fingers prevent robotic surgery patient precise checks safety inside seated trauma direct points during finger around either system wrists master under every carry vinci three image fixed below where which grasp hands using while wrist pivot real move high that hand four cart side body time eyes arms with da

We could pick further trash words from the results one by one, but there is a better way: let's re-run the program, this time selecting AbcRev for sorting. This is what we'll get:

da trauma cosmetic robotic positioned repeated seated fixed hand around side inside three image while console time outcome where precise relative move during using viewing which high with vinci surgical real overall control system surgeon precision definition grasp under finger either master maneuver four surgeon's patient's hands commands reduces includes requires increases translates operates improves eyes checks controls arms fingers movements instruments points wrists that direct independent patient movement prevent pivot cart wrist below display body comfortably naturally seamlessly surgery every carry safety

The output is backward-sorted. We can copy surgeon's patient's / hands commands reduces includes requires increases translates operates improves / comfortably naturally seamlessly in groups as indicated by the slashes.
Our new synonyms file is:

*trash1 into and his the any out her at or to it is of be a d
*trash2 surgeon's patient's hands commands reduces includes requires increases translates operates improves comfortably naturally seamlessly

Keep in mind that although the second set of synonyms shown above may seem to span multiple lines, this is due to the way they are displayed, and in the actual synonyms file, each set of synonyms must be on one separate line.
Re-running the program with the extended synonyms file will yield:

da trauma cosmetic robotic positioned repeated seated fixed hand around side inside three image while console time outcome where precise relative move during using viewing which high with vinci surgical real overall control system surgeon precision definition grasp under finger either master maneuver four eyes checks controls arms fingers movements instruments points wrists that direct independent patient movement prevent pivot cart wrist below display body surgery every carry safety

At this point, we already have a higher concentration of important terms in the output, but there is still work to be done.

With a larger text, it is advisable to continue by first selecting a fixed Length of at least 3 to see the remaining text in context, and only select Auto after some more elimination. However, with a small sample like this, let us jump straight to selecting Auto length, without changing any other parameters.
This is what we get:

trauma
seated
hand
image inside
either three
surgeon console
cosmetic outcome
where
precise
positioned relative
using
console while viewing
robotic arms move around fixed pivot points which
display with
direct control
da vinci surgical system
overall precision
high definition
fingers grasp
under
that every surgical maneuver
eyes
real-time movements
finger movements
surgical instruments
wrists
independent movement
repeated safety checks prevent
patient-side cart
wrist
master controls below
body
positioned during surgery
four robotic arms that carry

This was just to have a look at the context so we don't accidentally discard important words. For instance, we recognize that 'patient-side cart' and 'da vinci surgical system' are candidates for the glossary, so we must not discard any words that appear in these terms. But there is still trash to weed out, so let's re-run the program again with the same synonyms file, Len set to 1, Bare checked, and sorting set to AbcRev. We get:

da trauma cosmetic robotic positioned repeated seated fixed hand around side inside three image while console time outcome where precise relative move during using viewing which high with vinci surgical real overall control system surgeon precision definition grasp under finger either master maneuver four eyes checks controls arms fingers movements instruments points wrists that direct independent patient movement prevent pivot cart wrist below display body surgery every carry safety

Having seen the context in the previous step, we can now safely select positioned repeated seated fixed / during using viewing / precision / eyes checks arms fingers movements instruments points wrists that direct independent / every carry safety in blocks as indicated by the slashes. So let's add these further words to the synonyms file:

*trash1 into and his the any out her at or to it is of be a d
*trash2 surgeon's patient's hands commands reduces includes requires increases translates operates improves comfortably naturally seamlessly
*trash3 positioned repeated seated fixed during using viewing precision eyes checks arms fingers movements instruments points wrists that direct independent every carry safety

(We didn't select 'patient', since Auto run showed that 'patient-side cart' is a needed term, and we also recognized some further important terms like 'surgeon console', 'master controls', and some others.)

Re-run the program with the extended synonyms file, selecting Auto length again. The output will be this:

trauma
four robotic
hand
move around
image inside
either three
console while
surgeon console
real-time
cosmetic outcome
where
precise
relative
which
display with
overall
control
da vinci surgical system
high definition
grasp
under
finger
surgical maneuver
movement
prevent
pivot
patient-side cart
wrist
master controls below
body
surgery

We can now pick off most of the remaining trash words by extending the the synonyms file with these words: four move around either three inside while where which overall under movement prevent below relative precise grasp wrists.
So our (almost) final synonyms file will look like this:

*trash1 into and his the any out her at or to it is of be a d
*trash2 surgeon's patient's hands commands reduces includes requires increases translates operates improves comfortably naturally seamlessly
*trash3 positioned repeated seated fixed during using viewing precision eyes checks arms fingers movements instruments points wrists that direct independent every carry safety
*trash4 four move around either three inside while where which with overall under movement prevent below relative precise grasp wrists

Running the program with this synonyms file will yield:

trauma
robotic
hand
image
surgeon console
real-time
cosmetic outcome
control
da vinci surgical system
high definition
finger
surgical maneuver
master controls
pivot
patient-side cart
wrist
display
body
surgery

We are almost done, and I'll leave finishing the task to you.

Should you find that you mistakenly discarded an important word, simply delete it from the synonyms file and run StringMatch Pro again.

This lengthy description may make the process look complicated, but if you do the above example or a similar one yourself, you will find that the process is straightforward, fast, easy, and leaves you in complete control throughout.

Also, the process and the ease of use are the same even if your file contains millions of words.