We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! So without further ado here’s the first part:
1. Use the latest release of Solr
Unless there are compelling reasons not to, such as reliance on a discontinued feature (which is rare), it is best to use the latest release of Solr, downloaded from http://lucene.apache.org/solr/ . Every minor release in the 4.x series has brought both functional and performance enhancements, and revision releases have fixed known bugs. Since the API (as a rule) remains backwards compatible, the potential gains in performance and utility should outweight the minor inconvenience of the upgrade.
2. Use SolrCloud for scaling and robustness
Before the Solr 4 release, support for sharding (distributing a single search over many Solr instances) and replication (for robustness and scaling search load) involved a significant amount of manual configuration and development. The introduction of SolrCloud means that sharding and replication are now built into the core product, and can be used with simple configuration and no extra coding.
For trivial applications, SolrCloud may not be required, but it is the simplest way to build in robustness and scalability. There’s more about SolrCloud here.
3. Don’t expose the Solr API
Although Solr is not inherently insecure, neither is it designed to be exposed to end-users (and emphatically not to the internet at large). Anyone with access to the root Solr endpoint would be able to delete indexes, modify or insert items at will. Restricting access to search handlers (e.g. /solr/select
) avoids this possibility, but is nonetheless a bad idea since it may allow users to construct arbitrary queries which could degrade performance or provide access to unauthorised data. Furthermore, there remains the slim possibility of security holes in the Solr API.
For these reasons, any external access to search should be through a proxy interface which is restricted to the functionality required by the application. Access to the Solr API should be restricted by network design and/or firewalls. This applies equally to AJAX UIs, which should talk to Solr via an intermediary web application rather than directly.
The intermediary code should perform at least some basic validation of parameters before sending to Solr, for example checking their type and ensuring that query strings are under a certain length (depending on the search interface). This allows attempts at compromising the system to be detected at an early stage and blocked.
4. Don’t use third-party Solr client libraries
The problem with third-party client libraries is that they create a tight coupling between the application and Solr. The Solr XML and JSON APIs are simple, and a wide range of client libraries for these formats are readily available for most programming languages. Third-party libraries are an unnecessary additional dependency and a potential source of bugs and unexpected behaviour. Another risk is that development may be discontinued for various reasons, meaning that future Solr features are not easily accessible.
The one exception to this rule is the SolrJ Java client library, since it is part of the general Solr release and is therefore fully compliant with and tested against the corresponding version of Solr.
5. Specify interfaces
All interfaces between components in the application must be agreed between sys ops and developers before development is started. Interfaces should be treated as contracts which software components adhere to. Early documentation of interfaces will reduce the risk of unexpected dependencies leading to problems in deployment.
As far as possible, interfaces should be RESTful web APIs and use standard formats such as JSON and XML. This creates loose coupling between components and also makes it easy to test functionality from the command line or a browser.
6. Put apps live early, on isolated systems
Development should be iterative, with short development cycles (no more than a few weeks). Code should be tested and deployed at the end of each cycle. By using isolated systems, fake data and/or limiting access to authorised testers, functionality and performance may be tested as soon as possible on a ‘live’ system, avoiding the risk of unexpected problems if deployment is postponed until the end of the development cycle.
7. Do realistic performance tests early and often
Except for very small indexes, search performance is often unpredictable, particularly under load. To ensure that performance meets requirements, testing a full index under load with realistic queries should be scheduled as early as possible in development. If you don’t have the data available to create a full index, simulate it (e.g. using freely available text such as Wikipedia).
As new functions, e.g. facets, are added performance characteristics may change significantly, so it is important that performance tests are part of every development cycle. JMeter is a popular tool for load testing; alternatively test scripts could be easily written in a language like Python.
More to come next week!