Zoom Search Engine - International language support

Zoom has improved support for international language with the following features:

  • Unicode enabled indexer - This means you can now specify skip words and category names (and descriptions) in any language.
  • Support for UTF-8 searching and indexing - for websites using the UTF-8 encoding.
  • Support for international charset encodings such as windows-1251, etc.
  • Accent insensitivity option (for latin-based languages only): When enabled, accented characters will be equivalent to its non-accented equivalent (eg. searches for "cliché" will find "cliche" and vice versa)
  • Support for search page translations. This allows you to translate (or use one of the pre-defined translations) for search messages such as "No results found", or "Search results for...", etc.
  • Installed package comes with translated search pages for French, German, Italian, Danish, Croatian, Swedish, Norwegian, Spanish, Dutch, Polish Japanese, Russian, Arabic, Hebrew, Hungarian, Indonesian, Korean, Chinese, Romanian, Portuguese and Ukrainian. You can also create your own translations.

Unicode support

The indexer is now fully Unicode enabled and will support indexing and string matching on any Unicode text.

However, Windows 98 and earlier versions have partial Unicode support, so you may come across issues if you attempt to index with Unicode on these platforms.

Note: The JavaScript version may require IE 5.5 or above for some languages.

Tips for some common languages

European languages (French, German, Danish, Swedish, etc.)

European websites commonly use the standard English character encoding, CP1252 (also known as "windows-1252"). A few web sites will also use "iso-8859-1" which almost the same as "windows-1252". To make Zoom work with "windows-1252", ensure the following:

  • You have specified "windows-1252" or "iso-8859-1" in the Languages tab of the Indexer configuration window.
If your website uses UTF-8, ensure that:
  • You have checked "Use Unicode (UTF-8 encoding)" in the Languages tab of the Indexer configuration window.

Russian (Cyrillic)

Russian websites commonly use one of three encodings: KOI8, CP1251 (aka "windows-1251"), and Unicode UTF-8.

If your website uses KOI-8 or windows-1251, ensure that:

  • You have specified "windows-1251" or "koi-8r" in the Languages tab of the Indexer configuration window.

If your website uses UTF-8, ensure that:

  • You have checked "Use Unicode (UTF-8 encoding)" in the Languages tab of the Indexer configuration window.

Asian languages (Japanese, Chinese, etc.)

Since many east-asian languages do not have any word delimiting character (a character that determines the end and start of a word, such as a "space" in most latin-based languages), Zoom can only provide limited support for these languages.

Japanese

If your website is encoded in UTF-8, Zoom will successfully index your site, and will be capable of performing searches. However, search performance and accuracy is limited, as Zoom will only split words by:

  • Formatting (spaces between words, or paragraphs, etc.)
  • Change of character type (from hiragana to katakana, etc.)

This means that an entire sentence may be indexed as a "word". However, if you enable "Substring match for all searches" on the "Languages" tab of the Configuration window, then searches which appear within a sentence will match correctly.

Zoom does not currently support indexing Shift-JIS pages. You will have to convert your website to UTF-8 if you wish to use it with Zoom.

Chinese

Zoom now supports indexing Chinese pages in either Big5, GB2312 or UTF-8 encoding. However, search performance and accuracy is limited, as Zoom will only be able to split "words" based on formatting.

If you are indexing a Chinese website, you should enable the "Substring match for all searches" option on the "Languages" tab of the Configuration window. This will then allow for matching of words which have not been split correctly.

You should also enable the "Support single-case languages" option, also from the "Languages" tab of the Configuration window since Chinese does not have upper or lower case differences.

Notes for using GB2312 with the Javascript option: Due to issues with browser support for using GB2312 in Javascript (JS), it may be necessary to create a custom search form to encode the search query so that it can be correctly decoded in JS. If you are using GB2312 and the JS search option, you may need to use the following search form HTML to submit your query correctly:

<form method="get" action="search.html" onsubmit="window.location='search.html?zoom_query='+escape(this.zoom_query.value); return false;">
<input type="text" name="zoom_query" size="20"><br><br>
<input type="submit" value="Search">
</form>

Croatian

Croatian websites can often use a number of different encodings, including windows-1250 and UTF-8. Zoom will successfully index and search Croatian sites, however there are a few known issues with some Croatian diacritic characters.

First of all, if you are using an encoding/charset besides UTF-8 (such as windows-1250) AND you are using the Javascript search option, then some searches may fail if it contains certain diacritic characters (eg. HTML entities such as "&scaron;" and "&ccaron;"). Note that this is not an issue if you are using UTF-8 encoding on your website (and Zoom is configured accordingly). Note also that none of this applies if you are using one of the other search platforms available, namely PHP, ASP or CGI.

Another known issue is that the "Jump to highlighting" feature may also fail to work for words containing the aforementioned diacritic characters.

Arabic

Arabic websites can often use a number of different encodings, including windows-1255, windows-1256 and UTF-8. Zoom will successfully index and search Croatian sites, however there are a few known issues with some Croatian diacritic characters.

First, there is an option to "Strip Arabic diacritic marks from works" which will help when searching in form documents with arabic diacritic characters that are not often entered by most users. You can find this option on the "Languages" tab of the Configuration window and we recommend enabling this for Arabic searches.

Another known issue is that the "Jump to highlighting" feature may also fail to work for words containing the aforementioned diacritic characters.

Setting the locale on your webserver

Whether or not you need to change the locale setting depends on your web server environment settings, so you should ask your web host for more information on using foreign languages on their installed platform if you are uncertain.

For PHP users

The following is an example to set the russian locale on a Windows-based PHP server:

if (setlocale(LC_ALL, "rus") == false) // for russian
    print("Locale setting invalid");

To modify the "search.php" file permanently, click on "Templates"->"Modify search script source code" in the Zoom Indexer window. Alternatively, save a customized copy of the script somewhere and specify the path to this customized script on the "Advanced" tab of the Configuration window.

Locale names can be found here. Windows servers uses different locale names, and their list can be found here. More information on setlocale(...) is available on php.net

For ASP/IIS users

On some IIS servers, the locale and regional settings of the server may conflict with the execution of the ASP script. This will cause some characters to appear incorrectly on your search page, despite having the correct charset specified on your search template, and the corresponding encoding on the "Languages" tab of the Configuration window.

In such cases, you may need to add the following preprocessing directive to the ASP script. Note that this line must be added as the very first line of the "search.asp" file:

<%@ CODEPAGE=1252%>

To modify the "search.asp" file permanently, click on "Templates"->"Modify search script source code" in the Zoom Indexer window. Alternatively, save a customized copy of the script somewhere and specify the path to this customized script on the "Advanced" tab of the Configuration window.

Note: If you are using your own ASP search page, and you are embedding "search.asp" as described in this FAQ, then you will need to specify the above line on the top of your custom ASP search page and NOT in "search.asp" (since "search.asp" will only be executed AFTER your page, and this line needs to be the first line of ASP executed).

Known Issues

  • When indexing international text in plugin supported file formats such as PDF, XLS, etc. It is recommended to use UTF-8 encoding.
  • There are known issues with highlighting words in some languages and encodings. This may cause highlighting to vary depending on case sensitivity and the context description available.
  • There are issues with indexing some international characters from DOC files. This is usually when the Language setting is not specified within the DOC file itself using MS Word. Currently we are only aware of the Polish language having such issues.
  • Some known issues with czech characters when searching with the ASP version on IIS servers.
  • There are known issues with case sensitivity for some Polish and Greek characters in the ASP and CGI versions. We are currently investigating these issues for possible solutions.
  • There are known issues with case sensitivity for some Cyrillic characters in the ASP script. Some words with upper case characters can not be searched for. Seems to be a bug in the ASP scripting language converting to lower case Cyrillic. Switching to the CGI option should help.
  • Zoom does not currently support indexing Shift-JIS. You will have to convert your website to UTF-8 if you wish to use it with Zoom.
  • When indexing pages that contain UTF-8 decomposed accents (sometimes called UTF-8-MAC) on Windows XP some characters with accents are not correctly handled. This is not an issue with normal pre-composed UTF-8 accented characters, and also not an issue on Vista and Win7.