DjVuSearch Ships with Non-English-Language Packs
by James Rile, PlanetDjVu, August, 2001
DjVuSearch is now supplied with stemming rules and a noise-word files for additional languages besides English (US).
Stemming rule files and noise word files are now supplied for Danish, Dutch, Finnish, German, Italian, Norwegian, Polish, Portuguese, Spanish, Swedish. The stemming rules also include a unique bi-lingual (French/English) file which enables search expansion on indexes and documents containing a mix of French and English text.
Test files are included to check the operation of stemming in all the supplied languages.
Stemming is a search expansion option which is 'on' by default in DjVuSearch. The reason for this is that stemming is almost always useful when making a search, and adds little to the time required to make a search.
DjVuSearch does not automatically find plurals of words entered in a search request, for example if you search for printer you will not find documents containing the word printers! With the stemming option selected it will find plurals and many other variations (e.g. printing, printed ) automatically.
However, if you are searching documents written in other languages, the English stemming rules will cause you to miss many word variations (e.g. verb and noun changes with gender) which do not occur in English, and you may find that words which are unrelated are found in error.
Furthermore, the English noise word list, which is designed to remove unwanted English words from your index to keep the index size small, is not suitable for other languages; your indexes will contain many words which will not be useful in searches and which will add to the size of your indexes.
The solution is to use the supplied language specific files in place of the default US English files.
Stemming rule files and noise word files for languages not already supported will be developed on request. Please enquire with PlanetDjVu if the language you are indexing is not already in the included set of supported languagues.