Indexing Enormous Sites - Hints and Tips


Indexing huge amounts of data can be a big job.

This page covers a number of hints and tips that should make the job quicker and easier. It also includes a real-life case study of indexing all of Wikipedia and information about dedicated search appliances.

For the purposes of this article, an enormous site is one that has 500,000 or more pages, and/or 500 million or more words.

Indexing enormous sites is a lot like eating pancakes...

File and Folder structure

Store sample data in a organized manner

The content to be indexed should be organized into a practical and multi-tiered structure. At most folders should only ever contain a few thousand folders or files. This type of organization is recommended as NTFS can often become unstable or unresponsive if an individual folder contains too many files.

Disable 8.3 filenames

Another way of improving NTFS performance is to disable the creation of 8.3 filenames and directories. Although doing this can help to increase directory enumeration performance, some 16-bit applications may not be able to find files and directories with long filenames, but this should not be an issue if you are running a 64bit O/S. See this Microsoft Knowledge Base article (121007) for help on how to disable this feature.

Disk and Storage considerations

Store sample data locally if possible

It is strongly recommended that you store the data to be indexed locally, on the same machine as the Zoom Indexer is running from. Generating the required indexes involves a high amount of disk read and some write activity, and by using a local internal drive you can gain a significant performance boost over using an external drive or remote location. Local data also removes the network connection as a possible point of failure during a long indexing session.

Use separate drives

When possible, try to use separate disk drives for the source data being indexed and the index files outputted from the indexer. This will allow the OS to simultaneously read from one disk and write to another, thus slightly increasing the performance of the indexing.

Ensure adequate disk space

Before starting the indexing, make sure that you have an adequate amount of disk space to complete the indexing process. The index files produced from large amounts of data are inherently going to be large in size themselves. It is highly variable depending on what documents are being indexed, but for example, indexing 300GB of data might result in a set of index files around 50GB.

Defragment drives

Another performance improvement can be gained by ensuring that the drive where the source data resides has been defragmented.

Split indexing process over multiple machines

If it makes sense, split the source files into categories and perform indexing on smaller portions of the data using separate machines, and thus greatly reducing the amount of time required to index the complete data set. Wrensoft provides a free software tool known as Zoom MasterNode that could be used as a front-end to these distributed index files so that they can be collectively searched. MasterNode works by taking any search request and transparently dividing the work amongst its slave node machines (where the various actual indexes are stored), which can result in better search performance and greater search capability.

Avoid using encrypted or compressed folders

Where possible don't use compressed or encrypted NTFS folders as this can have a considerable impact on the speed of indexing.

OS and Software Considerations

Use NTFS and not FAT

When indexing millions of pages, the output index files can be very large. FAT32 unfortunately only supports files up to 4GB in size whilst NTFS can support single files of up 16 exabytes.

Use 64-bit Windows

32-bit Windows applications are limited to 2 GB of usable RAM (with a few exceptions which are outside the scope of this document). With the 64-bit versions of Windows this limit is much higher, with some versions of Windows 64bit supporting up to 2 TB of RAM.

Although there are no motherboards that can support this amount of RAM nor will be for the foreseeable future, it is not uncommon to have motherboards which will support 32GB+ of RAM. As well as 64-bit OS's supporting larger amounts of physical memory, they also inherently allow for the use of much greater amounts of virtual memory, which is definitely a huge advantage, and actually probably a requirement, when continuously indexing large amounts of files over a number of days.

Use Zoom 64-bit

In order to obtain the full potential from a 64-bit OS, you need to use the 64-bit version of Zoom Indexer (which can be found in the Windows Start menu if you have purchased the Enterprise Edition). You should definitely use this 64-bit version if you meet any one of the following conditions:

  • You are indexing an enormous number of web pages or websites, and your index files are exceeding 2GB in size. The normal 32-bit Zoom Indexer will warn you when this happens.
  • You are indexing an enormous number of web pages or websites, and you require more than 2GB of memory and you have more than 3GB of memory installed on your computer.

Disable any interfering software

Indexing gigabytes of documents takes time. Apart from the OS considerations, it is highly recommended to turn off Windows' Automatic Updates and to disable any antivirus scanning of the data set and indexing folders. Automatic updates may require an associated restart and you do not want to lose any indexing progress due to a simple factor such as a system restart. Similarly with any antivirus software, you do not want to be subjected to a performance hit due to any on-access scans of the data set or indexing folders.

Hardware Considerations

Use reliable hardware

The hardware chosen to perform the indexing, especially if indexing large amounts of files, should be stable and reliable. As indexing 100's of Gigabytes of data can easily take a number of days, you do not want to risk losing any progress made due to a hardware fault or power failure.

Ideally a burn-in test should be carried out on the designated hardware prior to starting the Indexing. Tools like PassMark BurnInTest (a product from our parent company) could be used for this. If the indexing machine is located in an unreliable area with intermittent power failures, then you should definitely consider using an alternative backup power supply.

Use High End hardware if possible

Apart from the hardware being stable and reliable, it should also be relatively new and high performing. The faster the hardware used to index the files the less time will be required to perform the full index. The important elements in order are:

Spider Mode Offline mode
Fast stable network connection Fast internal hard disks. Preferably 2 or more. Read speed is more critical that write.
Lots of RAM (at least 2GB + 1GB per 1 million files indexed) Fast CPU Dual or quad core preferred. But more than 4 cores doesn't add much benefit.
Fast CPU Dual or quad core preferred. But more than 4 cores doesn't add much benefit Lots of RAM. (at least 2GB + 512MB per 1 million files indexed)
Disk is not so critical Network connection is not critical

It should be noted that Zoom will not proceed with indexing if it estimates that you need over 135% of the Total amount of RAM (note that this is not just available RAM, but the total RAM you have installed) on your computer. If you get this type of warning you will either need to install more RAM or run the indexing from a different machine with a greater amount of RAM.

Index one site at a time when using an unreliable connection

When trying to index a number of sites over an unreliable connection, it is safer to index one domain at a time using Zoom's incremental indexing tool ("Index"->"Incremental indexing "->"Add start points or domains to existing index"), making sure to backup the index files after the successful indexing of each domain.

Configuring Zoom

Use the CGI option in Zoom

If your dataset is considerably large and consists of over 65,000 files, you will need to set the platform to CGI in Zoom. The CGI version can support sites containing a million pages or more (depending on the content). The more RAM the indexing machine has, the greater the number of files supported.

The CGI version also provides high-performance support for very large sites. It is a binary application so it does not have any of the overhead issues of scripting languages such as PHP or ASP.

Use multiple threads

When using the spidering mode to index large sites, you should set the field "Multiple threads" (under "Configure"->"Spider options") to at least 2. This option tells Zoom to use multiple dedicated threads for downloading files, which will greatly increase the speed of the indexing. This option essentially allows Zoom to download multiple files in the background whilst indexing at the same time. The optimal setting depends on network bandwidth and latency, the speed of the site being indexed, and how much load you want to put on the site being indexed. Note that if you are indexing a site you don't own, then it is considered polite not to load the server too heavily (you can use "Spider throttling" to ensure this).

Index more pages and less words

The field "Max unique words" has a considerable impact on the amount of RAM required by the Zoom Indexer. On a typical 32-bit machine with 2GB of RAM you will be limited to setting this field to around 7 million unique words, when indexing around 5 million files. In order to index a greater number of files before hitting the max unique words limit, you could use the "Limit words per file" field. This option essentially restricts the indexer from indexing more than the set number of words per file, meaning that any unique words further down in the file will not be indexed, and hence increasing the number of files that you could potentially index. The most interesting sections of a document are generally near the top of the document. So limiting the words indexed per file can often reduce the RAM usage without having a significant negative impact on the search accuracy.

Index multiple pages from 1 domain

When using Spidering Index mode, it is considerably faster to index ten pages from one single site than to index one page from ten different sites. This is because when indexing from one single site the Indexer only needs to create a single http protocol session instead of one for each of the ten sites. Also, the multi threading feature in Spider mode works a lot better with just one site compared against multiple sites.

Use conservative limits when using spider mode on 32-bit

When using Spider mode on a 32-bit machine, it is recommended that the "Estimated RAM required" on the Limits tab is no more than two thirds of your total RAM. This is because Spidering consumes a considerable amount more memory than offline mode, and on 32-bit windows the amount of virtual memory is limited to 2GB. For example, on a 32-bit Windows Vista machine with 2GB RAM, max files to index set to 1 million, max unique words set to 2 million, and estimated RAM required set to 1.9GB, virtual memory peaked out after about 5 hours causing the Indexer to stop. If this happens, users should still be able to add additional pages to the set of index files, by restarting the machine and using the incrementally indexing options.

Index in offline mode when possible

Depending on the site, indexing in Offline mode consumes around 50% less RAM than indexing in Spider mode, and can be almost 3 times faster. Therefore if you are trying to index an extremely large website or group of websites, and you have access to the web pages themselves, it might be worth considering creating the indexes in offline mode especially if the network is slow or unreliable. It can change a 3 day indexing job to a 1 day job.

Experiment with optimization bar

If you are unhappy with the performance of the CGI i.e. you would rather give preference to either faster searches - at the cost of accuracy and potentially omitting some search results, or more accurate search results - at the expense of slower searching, then you can experiment with the Optimization slider bar. This bar allows you to configure 3 different settings: Max matches, Max content seeks, and Max search time. The Optimization bar is especially important with exact phrase matching, and searches which may return a huge number of results (e.g. over 1000). It is also important when indexing very large sites or datasets which contain over 1 million files. The bar should hopefully allow you to obtain the right balance between search accuracy and search speed.

Ensure configuration file is correct. Start small and ramp up

When indexing large volumes of data it is highly recommended to ensure that your configuration is correct with a much smaller set of files first, before attempting to index the complete set. Taking this step will help to avoid wasting a number of days indexing the entire set of data, only to find out afterwards that your results are not being displayed the way that you want, or your indexes include a lot of pages and content that you don't want to be searchable, etc.

Also, ensuring that your configuration file is correct before indexing your complete dataset will help to avoid any problems when performing an incremental index. In order to perform an incremental index, you must NOT have modified your indexing configuration since the last index was made. The .ZCFG file must contain the exact same settings, and the index files must still be in the output folder specified. This means that if you index your entire website using a particular configuration file, and you decide later that you want to add some more additional web pages, then you must use the same configuration file as used earlier. If you try to make any config changes for the new web pages that were not in the original configuration file, then the incremental index will fail, and you'll be forced to do a full index instead.

File Filtering and Skipping

Filter out unwanted data

When indexing very large amounts of data, it can be highly advantageous to configure the "Skip options" and "Filtering" options in the Zoom configuration file. As can often be the case with large volume of data, there can be huge amounts of pages/data that you are not interested in, and would not like to include in your set of indexing files. Using the configure tab in Zoom Indexer and under the appropriate views, you have the option to skip these unwanted pages/data.

Use the Skip option

With the Skip option the user has the ability to skip pages and folders based on their filename/folder name. The user needs to add an entry in the "Page and folder skip list" for every type of file/folder that the user wants to skip (note this field is case sensitive). Use the skip list where possible, as it is more efficient than the filtering option. See the User's Guide for additional details.

Use the Filter option

Under the Filtering view options, users have the ability to skip pages based on their content rather than their filename. Users can skip pages depending on whether that page has or has not a particular word. To use this option the user, must check the "Enable content filtering" checkbox and specify a keyword proceeded with the appropriate rule ("+" implies that a page must include the keyword, "-" implies the page must not include the keyword). See the User's Guide for additional details.

Managing an Existing Index

Use incremental indexing to add new pages to large indexes

Zoom allows you to perform incremental indexing so that you can update, add pages, and make changes to an existing set of index files without having to perform a full re-index. This can be particularly useful for large sites (or an index of a large number of sites) where there may be minor incremental changes that require updates to the index, but would be too time consuming (or bandwidth consuming) to re-index the entire site using Spider mode on a regular basis. The option "Update an existing index" is only available in Spider Mode and not available in Offline Mode. Offline mode indexing does not use any Internet traffic and hence is many times faster than spider mode, so this should not be necessary and we would recommend a full re-index for offline mode users. See the User's Guide for additional details about incremental indexing.

Add pages in batches

When adding pages to an existing large set of index files it is considerably more efficient to add a group of pages together rather than adding the pages one at a time. For example, with a set of index files consisting of 2.5 million pages it took the indexer 6:14 minutes to add a single page. On the other hand, adding a group of 5 pages only took 6:43 minutes (average 1.3 minutes per file). This is because the most time is spent loading the dictionary and data files into memory, and not adding the new files. With very large index files the amount of time taken to actually add a new page is insignificant compared with the initially indexing of these files.

Indexing Limitations

Although Zoom is capable of indexing a huge number of pages, it is worth noting that the 64bit software has some theoretical limitations. These are as follows:

Maximum Base Unique Words
(a base word is the stemmed word, e.g. "run")
16 Million (16,777,216)
Maximum Variant Unique Words
(a variant word is the un-stemmed word, e.g "running", "runs", "RUN", etc)
256 Variant words per base word
Maximum Total Variant Unique Words 4 Billion
(16 Million X 256 = 4,294,967,296)
Maximum Number of Pages 2 Billion (2,147,483,647)
Maximum number of words (non unique) No fixed limit

In the real word however, practical limits of available RAM, indexing time, machine stability and acceptable search times mean that the theoretical limits will probably not be hit.

In our real life testing a typical 32-bit Windows Vista machine with 2 GB of RAM, Zoom Indexer can support up to around 7 million unique words and 5 million indexed files in offline mode with the "Limit words per file" option set to 250 words.

On a 64-bit Windows machine with 4 GB of RAM, Zoom can support about 4 million files and 19 million unique words with no limit set on the number of words per page.

At this point the machines were running out of RAM, indexing times where measured in days, and some of the worst case search times were getting to be unacceptably long. So in general we would suggest staying well below these limits until hardware and network performance improves in future years. See the example below.

Real Life Example (indexing Wikipedia)

The following test scenario was carried out in order to demonstrate Zoom's capabilities in indexing vast volumes of data, and to uncover any limitations that it may have.

Sample Data

For our tests we used a set of static HTML dumps of Wikipedia (http://static.wikipedia.org/) from the English June 2008 edition. The total number of folders was 85,651 and files 14,257,513 with a total uncompressed file size of 325GB (on disk), of which the photos and images accounted for 93GB. The folder structure was .../<letter>/<letter>/<letter>/ and then a collection of the appropriate html pages for that particular directory. For example if you were looking for a page on Zombies, you would navigate to .../z/o/m/ and look for the zombies.html page.

Hardware Configuration

2 different machines were used to test the relative limitations of indexing large volumes of data. They were:

Windows 32-bit Machine:
Gigabyte GA-MA790XT-UD4P Motherboard
2.6 GHz AMD Phenom II X4 910 (4 cores)
2 GB DDR3 RAM
750GB Barracuda SATA hard drive (7200rpm and 32MB cache) for the output of index files
500GB Barracuda SATA hard drive (7200rpm and 32MB cache) to store sample wikipedia data
Windows Vista Ultimate 32-bit

Windows 64-bit Machine:
Intel DX58S0 Motherboard
2.67Ghz Intel Core i7 920 (4 cores + hyperthreading)
4GB DDR3 RAM
500GB Western Digital SATA hard drive (7200rpm and 16MB cache) for the output of index files
500GB Hitachi SATA hard drive (7200rpm and 16MB cache) to store Wikipedia sample data
Windows 7 Ultimate 64-bit

A 32-bit version of Zoom Indexer in offline mode was run on the 32-bit machine . The sample data was directly access via the filesystem in offline mode. On the 64-bit machine, the 64-bit version of Zoom was run in Offline mode and Spider mode. For Spider mode the target site was hosted locally on the machine via an Apache HTTP Server.

Zoom configuration file:
V6 build 1014 of Zoom was used.

Configuration Options
32-bit
64-bit
64-bit (Spider)
Indexing mode
Offline mode
Spider mode
Platform
CGI/Win32
Multiple threads
NA
10
Max limit for files to index
5,000,000
10,000,000
5,000,000
Max limit for unique words
7,250,000
19,500,000
14,000,000
Limit words per file
150
None
250
Optimization slider
Max matches = 10000, Max context seeks = 5000, Max search time = 120 seconds
Content Filtering
-Redirecting, -Orbital period
Page and folder skip list
Talk, talk, sandbox

In our real-life scenario we had a large number of Wikipedia "Talk" pages separate from the main content which we were generally not interested in. In order to configure the Indexer to skip any of these types of pages or folders we added the words "Talk" and "talk" (Both cases were needed due to case sensitivity) to the "Page and folder skip list" in the Skip options view.

With this configuration change the Indexer checks each file and folder name for any of the words in the skip list and will then skip appropriately.

Likewise we had a large number of redirect pages which had very little content on them except for a redirect link. In order for us to skip these types of pages based on their content, we simple added "-Redirecting" to the filtering list.

The main reason why we added these configurations options to skip these types of pages was that our set of sample data was extremely large (over 14 million files) and we only wanted to index the more relevant and interesting set of pages.

Screenshot of 64-bit machine in spider mode:

This screen shot shows 3.7million pages and 891 million words indexed during a period of 3.8 days of continuous indexing.

Screenshot of large site

 

Final metric results:

METRIC
MACHINE 1
(32BIT)
MACHINE 2
(64BIT SPIDER MODE)
MACHINE 2
(64BIT, OFFLINE MODE)
Files indexed
2,752,682
3,715,537
4,450,797
Files skipped
5,562,179
23,681,737
7,741,278
Files filtered
1,514,815
1,005,619
2,488,281
Files downloaded
N/A
4,721,166
N/A
Unique words found
7,250,000
10,335,921
19,500,000
Variant words found
5,889,984
8,300,273
14,869,970
Total words found
432,244,142
891,722,867
2,515,481,986
Avg. Unique words per page
2.63
2.78
4.38
Avg. Words per page
157
239
565
Elapsed index time
23:42
(HH:MM)
91:29
(HH:MM)
73:34
(HH:MM)
Peak physical memory used during indexing (MB)
789
2,717
2,555
Peak virtual memory used during indexing (MB)
1,712
6,386
4,749
Size of index files (GB)
4.85
10
22.9
Size of index as a % of source data
4%
7%
10%
Typical single word search times
(Core i7 CPU, 64bit, 4GB Ram)
6 SECS
7 SECS
11 SECS
Typical exact phrase search times
(Core i7 CPU, 64bit, 4GB Ram)
5 TO 7 SECS
10 TO 30 SECS
15 TO 50 SECS

Dedicated search Appliances

Further significant performance improvements can be made by switching to a dedicated search appliance. We offer bespoke search appliances as a service. By using a dedicated machine it is possible to switch from using the CGI option to a related technology known as FastCGI. Example peformance improvements are as follows.

Search Type CGI Search times FastCGI Search Times
Single Word
4.4 sec
0.05 sec
Multi-word
4.4 sec
0.08 sec
Wildcard search
4.6 sec
0.09 sec
Exact phrase
5.6 sec
0.44 sec

Times measured on a old P4 hardware with a million page index.

Advantages of FASTCGI

  • Massive increases in search speed as a result of advanced caching and the removal of disk access.

Disadvantages of FASTCGI

  • Significantly more RAM is required in the server, and the RAM is held in use over a longer period of time.
  • Initial setup is more complex than CGI (which can already be complex on some servers)
  • Slightly more ongoing maintanence is required.
  • The update of the index can be more complex

So the implementation of FASTCGI only really makes sense on dedicated servers. If you are interested, contact us for more information about FASTCGI and dedicated appliances.