Principles of Solr application design – part 2 of 2

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! Here’s the second part, you can also read the first part.

8. Have enough RAM

The single biggest performance bottleneck in most search installations is lack of RAM. Search is an I/O-intensive process, and the more that disk reads can be cached in memory, the better performance will be. As a rough guideline, your available RAM should be at least 50% the total size of your Solr index files. For demanding applications, up to 100% of the index size may be necessary.

I/O caching is incremental rather than immediate, and some minutes of searches under load may be required to warm them. Don’t expect high performance until the caches are thoroughly warmed up.

An increasingly popular alternative is to use solid state disks (SSDs) instead of traditional hard disks. These are hundreds of times faster, and mean that cold searches should be reasonably fast. They also reduce the amount of RAM required to perhaps as little as 10% of the index size (although as always, this will require testing for the application in question).

9. Use a dedicated machine or VM

Don’t share your Solr servers with any other demanding processes such as SQL databases. For dependable performance, Solr should not have to compete with other processes for resources. VMs are an effective way of ring-fencing resources.

10. Use MMapDirectory and 64-bit systems

By default, Solr on 64-bit systems will open indexes with Lucene’s MMapDirectory, which memory-maps files rather opening them for read/write/seek. Don’t change this! MMapDirectory allows for the most effective use of resources, in particular RAM (which as already described is a crucial resource for search performance).

11. Tune the Solr caches

The OS disk cache improves performance at the low level. At the higher level, Solr has a number of built-in caches which are stored in the JVM heap, and which can improve performance still further. These include the filter cache, the field value cache, the query result cache and the document cache. The filter cache is probably the most important to tune if you are using filtered queries extensively or faceting with the enum method – each entry in the filter cache takes up ( number of docs on shard / 8 ) bytes of space, so if you’ve got a cache limit of 4,000 then you’ll require (numDocs * 500) bytes to hold all of them. However, tuning all of these caches has the potential to improve performance.

To tune the caches, you should allow Solr to run for a while with real or simulated search activity. Then go to the Plugin/Stats page in the admin web interface. The first important number in the cache statistics is ‘hitratio’. This should ideally be as close to 1.0 as possible, indicating that most lookups are being serviced by the cache. Then, ‘evictions’ indicates how many items have been removed from the cache due to limited space. This should ideally be as close to zero as possible, or at least much smaller than ‘lookups’.

If ‘evictions’ is high and ‘hitratio’ low, you should increase the maximum cache size in solrconfig.xml. It is impossible to say what a good starting point for a specific application is, but we often pick 4000.

If the cache is performing well, it may be worth reducing the maximum size and re-testing. The purpose of the maximum size is to prevent the cache growing without limit and filling the JVM heap, which links to point 12 below.

See here more information on Solr caches.

12. Minimise JVM heap space

Once you have tuned your Solr caches, try to reduce the maximum JVM heap (set with -Xmx) to a reasonably small size – big enough to hold the caches and all the other data required for searching and indexing, but not much bigger. There is a graphical depiction of the JVM heap in the Solr admin dashboard which allows a quick overview for rough tuning. For a better picture, it may be worth using a tool like JConsole to monitor the heap as the application is used.

The reason to reduce the heap size is to free RAM for the OS disk cache, as described in point 8.

Garbage collection (GC) can be a problem if the heap size is large. See here for information on GC tuning in Solr and other performance issues.

13. Handle multiple languages with multiple fields

Some search applications need to be able to support documents of different languages within the same index. This may conflict with the use of stemming, stopwords and synonyms to improve search accuracy. Furthermore, languages like Japanese are not tokenised by Solr in the same way as European languages, due to different conventions on word boundaries. One effective method for supporting mutiple languages in an index with per-language term processing is outlined as follows. Note that this depends on knowing in advance what language a section of text is in.

First, create a variant of each text field in the index schema for each language to be supported. The schema.xml supplied with Solr has example fieldtypes for a wide range of languages which may be adapted as necessary. For example:

˂field name="content_en" type="text_en" indexed="true" stored="true"/ ˃
˂field name="content_fr" type="text_fr" indexed="true" stored="true"/ ˃
˂field name="content_jp" type="text_jp" indexed="true" stored="true"/ ˃

Note the use of language codes to distinguish the names of the fields and fieldtypes. Then, when indexing each document, send each section of text to the appropriate field. E.g., if the document is entirely in English, send the whole thing to content_en. If it has sections in English, French and Japanese, send them to content_en, content_fr and content_jp respectively. This ensures that text is tokenised and normalised appropriately for its language.

Finally for searching, use the eDisMax query parser, and include all the language fields in the qf parameter (and pf, if using). E.g., in solrconfig.xml:

˂requestHandler name="/search" class="solr.SearchHandler"˃
˂lst name="defaults"˃
˂str name="qf"˃content_en content_fr content_jp˂/str˃
˂str name="pf"˃content_en content_fr content_jp˂/str˃
...

When a search is executed with this handler, subqueries will be generated for each language with the appropriate term processing, and searched against each language text field. This approach should give the best precision and recall in a multi-language application.

Leave a Reply

Your email address will not be published. Required fields are marked *