Suggestions please for software able to take a document file and extract unique words and phrases, ignoring common expressions and words (how / when / where / why etc).
I'm preparing a master index that will contain links to documents themselves, but for space and -in some cases- copyright reasons should not duplicate the content of those files.
The master index will have inbuilt search capability, but can't 'see' the content of referenced files. Our workaround so far is to batch index the files and include a (much shorter) list of indexed words with each link.
As an example, searching for 'fly + fishing' should then turn up a respectably short but relevant listing of possible links even if one of the documents is compleatangler.pdf