Tom – Flax

A search-based suggester for Elasticsearch with security filters

Tom — Thu, 16 Nov 2017 14:30:12 +0000

Both Solr and Elasticsearch include suggester components, which can be used to provide search engine users with suggested completions of queries as they type:

Query autocomplete has become an expected part of the search experience. Its benefits to the user include less typing, speed, spelling correction, and cognitive assistance.

A challenge we have encountered with a few customers is autocomplete for search applications which include user-based access control (i.e. certain documents or classes of document are hidden from certain users or classes of user). In general, it is desirable not to suggest query completions to users which only match documents they do not have access to. For one thing, if the system suggests a query which then returns no results, it confounds the user’s expectation and makes it look like the system is in error. For another, suggestions may “leak” information from the system that the administrators would rather remain hidden (e.g. an intranet user could type “dev” into a search box and get “developer redundancies” as a suggestion.)

Access control logic is often implemented as a Boolean filter query. Although both the Solr and Elasticsearch suggesters have simple “context” filtering, they do not allow arbitrary Boolean filters. This is because the suggesters are not implemented as search components, for reasons of performance.

To be useful, suggesters must be fast, they must provide suggestions which make intuitive sense to the user and which, if followed, lead to search results, and they must be reasonably comprehensive (they should take account of all the content which the user potentially has access to.) For these reasons, it is impractical in most cases to obtain suggestions directly from the main index using a search-based method.

However, an alternative is to create an auxiliary index consisting of suggestion phrases, and retrieve suggestions using normal queries. The source of the suggestion index can be anything you like: hand-curated suggestions and logged user queries are two possibilities.

To demonstrate this I have written a small proof-of-concept system for a search-based suggester where the suggestions are generated directly from the main documents. Since any access control metadata is also available from the documents, we can use it to exclude suggestions based on the current user. A document in the suggester index looks something like this:

suggestion: "secret report"
freq: 16
meta:
  - include_groups: [ "directors" ]
    exclude_people: [ "Bob", "Lauren" ]
  - include_groups: [ "financial", "IT" ]
    exclude_people: [ "Max" ]

In this case, the phrase “secret report” has been extracted from one or more documents which are visible to the group “directors” (excluding Bob and Lauren) and one or more documents visible to groups “financial” and “IT” (excluding Max.) Thus, “secret report” can be suggested only to those people who have access to the source documents (if filtering is included in the suggestion query).

The proof of concept uses Elasticsearch, and includes Python code to create the main and the suggestion indexes, and a script to demonstrate filtered suggesting. The repository is here.

If you would like Flax to help build suggesters for your search application, do get in touch!

The post A search-based suggester for Elasticsearch with security filters appeared first on Flax.

Better performance with the Logstash DNS filter

Tom — Thu, 17 Aug 2017 15:45:58 +0000

We’ve been working on a project for a customer which uses Logstash to read messages from Kafka and write them to Elasticsearch. It also parses the messages into fields, and depending on the content type does DNS lookups (both forward and reverse.)

While performance testing I noticed that adding caching to the Logstash DNS filter actually reduced performance, contrary to expectations. With four filter worker threads, and the following configuration:

dns { 
  resolve => [ "Source_IP" ] 
  action => "replace" 
  hit_cache_size => 8000 
  hit_cache_ttl => 300 
  failed_cache_size => 1000 
  failed_cache_ttl => 10
}

the maximum throughput was only 600 messages/s, as opposed to 1000 messages/s with no caching (4000/s with no DNS lookup at all).

This was very odd, so I looked at the source code. Here is the DNS lookup when a cache is configured:

address = @hitcache.getset(raw) { retriable_getaddress(raw) }

This executes retriable_getaddress(raw) inside the getset() cache method, which is synchronised. Therefore, concurrent DNS lookups are impossible when a cache is used.

To see if this was the problem, I created a fork of the dns filter which does not synchronise the retriable_getaddress() call.

 address = @hit_cache[raw]
 if address.nil?
   address = retriable_getaddress(raw)
   unless address.nil?
     @hit_cache[raw] = address
   end
 end

Tests on the same data revealed a throughput of nearly 2000 messages/s with four worker threads (and 2600 with eight threads), which is a significant improvement.

This filter has the disadvantage that it might redundantly look up the same address multiple times, if the same domain name/IP address turns up in several worker threads simultaneously (but the risk of this is probably pretty low, depending on the input data, and in any case it’s harmless.)

I have released a gem of the plugin if you want to try it. Comments appreciated.

The post Better performance with the Logstash DNS filter appeared first on Flax.

Elasticsearch, Kibana and duplicate keys in JSON

Tom — Thu, 03 Aug 2017 11:05:14 +0000

JSON has been the lingua franca of data exchange for many years. It’s human-readable, lightweight and widely supported. However, the JSON spec does not define what parsers should do when they encounter a duplicate key in an object, e.g.:

{
  "foo": "spam",
  "foo": "eggs",
  ...
}

Implementations are free to interpret this how they like. When different systems have different interpretations this can cause problems.

We recently encountered this in an Elasticsearch project. The customer reported unusual search behaviour around a boolean field called draft. In particular, documents which were thought to contain a true value for draft were being excluded by the query clause

{
  "query":
    "bool": {
      "must_not": {
        "term": { "draft": false }
      },
      ...

The version of Elasticsearch was 2.4.5 and we examined the index with Sense on Kibana 4.6.3. The documents in question did indeed appear to have the value

{
  "draft": true,
  ...
}

and therefore should not have been excluded by the must_not query clause.

To get to the bottom of it, we used Marple to examine the terms in the index. Under the bonnet, the boolean type is indexed as the term “T” for true and “F” for false. The documents which were behaving oddly had both “T” and “F” terms for the draft field, and were therefore being excluded by the must_not clause. But how did the extra “F” term get in there?

After some more experimentation we tracked it down to a bug in our indexer application, which under certain conditions was creating documents with duplicate draft keys:

{
  "draft": false,
  "draft": true
  ...
}

So why was this not appearing in the Sense output? It turns out that Elasticsearch and Sense/Kibana interpret duplicate keys in different ways. When we used curl instead of Sense we could see both draft items in the _source field. Elasticsearch was behaving consistently, storing and indexing both draft fields. However, Sense/Kibana was quietly dropping the first instance of the field and displaying only the second, true, value.

I’ve not looked at the Sense/Kibana source code, but I imagine this is just a consequence of being implemented in Javascript. I tested this in Chrome (59.0.3071.115 on macOS) with the following script:

which output (with no warnings)

value of o.b true
test.html:13 value of o {
 "s": "this is some text",
 "b": true
}

(in fact it turns out that order of b doesn’t matter, true always overrides false.)

Ultimately this wasn’t caused by any bugs in Elasticsearch, Kibana, Sense or Javascript, but the different way that duplicate JSON keys were being handled made finding the ultimate source of the problem harder than it needed to be. If you are using the Kibana console (or Sense with older versions) for Elasticsearch development then this might be a useful thing to be aware of.

I haven’t tested Solr’s handling of duplicate JSON keys yet but that would probably be an interesting exercise.

The post Elasticsearch, Kibana and duplicate keys in JSON appeared first on Flax.

Release 1.0 of Marple, a Lucene index detective

Tom — Fri, 24 Feb 2017 14:34:05 +0000

Back in October at our London Lucene Hackday Flax’s Alan Woodward started to write Marple, a new open source tool for inspecting Lucene indexes. Since then we have made nearly 240 commits to the Marple GitHub repository, and are now happy to announce its first release.

Marple was envisaged as an alternative to Luke, a GUI tool for introspecting Lucene indexes. Luke is a powerful tool but its Java GUI has not aged well, and development is not as active as it once was. Whereas Luke uses Java widgets, Marple achieves platform independence by using the browser as the UI platform. It has been developed as two loosely-coupled components: a Java and Dropwizard web service with a REST/JSON API, and a UI implemented in React.js. This approach should make development simpler and faster, especially as there are (arguably) many more React experts around these days than native Java UI developers, and will also allow Marple’s index inspection functionality to be easily added to other applications.

Marple is, of course, named in honour of the famous fictional detective created by Agatha Christie.

What is Marple for? We have two broad use cases in mind: the first is as an aid for solving problems with Lucene indexes. With Marple, you can quickly examine fields, terms, doc values, etc. and check whether the index is being created as you expect, and that your search signals are valid. The other main area of use we imagine is as an educational tool. We have made an effort to make the API and UI designs reflect the underlying Lucene APIs and data structures as far as is practical. I have certainly learned a lot more about Lucene from developing Marple, and we hope that other people will benefit similarly.

The current release of Marple is not complete. It omits points entirely, and has only a simple UI for viewing documents (stored fields). However, there is a reasonably complete handling of terms and doc values. We’ll continue to develop Marple but of course any contributions are welcome.

You can download this first release of Marple here together with a small Lucene index of Project Gutenberg to inspect. Details of how to run Marple (you’ll need Java) are available in the README. Do let us know what you think – bug reports or feature requests can be submitted via Github. We’ll also be demonstrating Marple in London on March 23rd 2017 at the next London Lucene/Solr Meetup.

The post Release 1.0 of Marple, a Lucene index detective appeared first on Flax.

Simple Solr connector for React.js

Tom — Wed, 29 Jun 2016 13:24:12 +0000

We’ve just published a simple (60 lines of code) React.js component to npm which makes it easy to perform searches on a Solr 6 instance and get the data into the app to display. Unlike Twigkit or Searchkit this is not a UI library – it is just a connector. If you use it you will have to implement all the UI components yourself. The npm package name is react-solr-connector and the code is available on GitHub (Apache 2.0 licence).

react-solr-connector provides a React component to contain your app. This component injects a prop into the app which includes a callback for performing a search. When the search results are available, the prop is updated so that the app can re-render. For full details, see README.md.

This package currently provides the bare minimum functionality in order to be useable, and requires knowledge of the Solr API in order to implement a working search app. I intend to start adding helper functions to the package to make this easier.

react-solr-connector is not the right choice if you are using Redux or another state management library. I may implement a Redux reducer for Solr at some point (although it appears that there is at least one other person working on this).

New: A simple facetted search app with highlighting which uses the React Solr connector.

The post Simple Solr connector for React.js appeared first on Flax.

Running out of disk space with Elasticsearch and Solr: a solution

Tom — Thu, 21 Apr 2016 09:44:35 +0000

We recently did a proof-of-concept project for a customer which ingested log events from various sources into a Kafka – Logstash – Elasticsearch – Kibana stack. This was configured with Ansible and hosted on about a dozen VMs inside the customer’s main network.

For various reasons resources were tight. One problem which we ran into several times was running out of disk space on the Elasticsearch nodes (this was despite setting up Curator to delete older indexes, and increasing the available storage as much as possible). Like most software, Elasticsearch does not always handle this situation gracefully, and we often had to ssh in and manually delete index files to get the system working again.

As a result of this experience, we have written a simple proxy server which can detect when an Elasticsearch or Solr cluster is close to running out of storage, and reject any further updates with a configurable error (503 Unavailable would seem to be the most appropriate) until enough space is freed up for indexing to continue. We call this Hara Hachi Bu, after the Confucian teaching to only eat until you are 80% full. It is available to download on GitHub and has the Apache 2.0 license. This is a very early release and we would welcome feedback or contributions. Although we have tested it with Elasticsearch and Solr, it should be adaptable to any data store with a RESTful API.

Technical stuff

The server is implemented using DropWizard (version 0.9.2), a framework we’ve used a lot for its ease of use and configurability. It is intended to sit between an indexer and your search engine (or a similar disk-based data store), and will check that disk space is available when requesting certain endpoints. If the disk space is less than a configured threshold value, the request will be rejected with a configurable HTTP status code.

There are disk space checkers for Elasticsearch (using the /_cluster/stats endpoint), a local Solr installation, or a cluster of hosts. If using a cluster, each machine is required to regularly post its disk space to the application. Custom implementations can also be added, by implementing the DiskSpaceChecker interface.

The trickiest part of the implementation was to allow DropWizard endpoints through without them being proxied. We did this by implementing both a filter and a servlet – the filter looks out for locally known endpoints and passes them straight through, while unknown endpoints have a /proxy prefix added to the URL path and then caught by the proxy servlet. The filter also carries out the disk space check on URLs in the check list, allowing them to be rejected before reaching the servlet. (If you’ve come up with a different solution to this problem, we’d be interested to hear about it.)

The proxy was implemented by extending the Jetty ProxyServlet (http://www.eclipse.org/jetty/documentation/current/proxy-servlet.html) – this allowed us to override a single method in order to implement our proxy, stripping off the /proxy prefix and redirecting the request to the configured host and port.

Internally, the application will build the DiskSpaceChecker defined in the configuration. DropWizard resources (or endpoints) and health checks are added depending on the implementation, with a default, generic health check which simply checks whether or not disk space is currently available. The /setSpace resource is only available when using the clustered configuration, for example.

The post Running out of disk space with Elasticsearch and Solr: a solution appeared first on Flax.

Elasticsearch vs. Solr: performance improvements

Tom — Fri, 18 Dec 2015 17:04:20 +0000

I had been planning not to continue with these posts, but after Matt Weber pointed out the github pull requests (which to my embarrassment I’d not even noticed) he’d made to address some methodological flaws, another attempt was the least I could do.

For Solr there was a slight reduction in mean search time, from 39ms (for my original, suboptimal query structure) to 34ms and median search time from 27ms to 25ms – see figure 1. Elasticsearch, on the other hand, showed a bigger improvement – see figure 2. Mean search time went down from 39ms to 27ms and median from 36ms to 24ms.

Comparing Solr with Elasticsearch using Matt’s changes, we get figure 3. The medians are close, at 25ms vs 24ms, but Elasticsearch has a significantly lower mean, at 27ms vs 34ms. The difference is even greater at the 99th percentile, at 57ms vs 126ms.

These results seem to confirm that Elasticsearch still has the edge over Solr. However, the QPS measurement (figure 4) now gives the advantage to Solr, at nearly 80 QPS, vs 60 QPS for Elasticsearch. The latter has actually decreased since making Matt’s changes. This last result is very unexpected, so I will be trying to reproduce both figures as soon as I have the chance (as well as Matt’s suggestion of trying the new-ish filter() operator in Solr).

Our sincere thanks to Matt for his valuable input.

Figure 1: Solr search times, no indexing, original vs Matt

Figure 2: ES search times, no indexing, original vs Matt

Figure 3: ES vs Solr search times, no indexing, Matt’s changes

Figure 4: QPS, original vs Matt’s changes

The post Elasticsearch vs. Solr: performance improvements appeared first on Flax.

Elasticsearch vs. Solr performance: round 2.1

Tom — Fri, 11 Dec 2015 16:01:54 +0000

Last week’s post on my performance comparison tests stimulated quite a lot of discussion on the blog and Twitter, not least about the large disparity in index sizes (and many thanks to everyone who contributed to this!) The Elasticsearch index was apparently nearly twice the size of the Solr index (the performance was also roughly double). In the end, it seems that the most likely reason for the apparent size difference was that I caught ES in the middle of a Lucene merge operation (this was suggested by my colleague @romseygeek). Unfortunately I’d deleted the EBS volume by this point, so it was impossible to confirm.

I thought it might be interesting to compare the performance of just a single shard on a single node, with 2M documents. DocValues were enabled for the non-analysed Solr field. The index sizes were indeed close to identical, at 3.6GB (and the indexing process took 18m47s for ES, 18m05s for Solr). The mean search times were also the same, at 0.06s with concurrent indexing and 0.04s without. However, the distribution of search times was quite different (see figures 1 and 2). ES had a flatter distribution, which meant that 99% of searches took less than 0.08s compared with 0.15s for Solr (without concurrent indexing).

The most dramatic difference was in QPS. As in the previous study, ES supported approximately twice the QPS of Solr (see figure 3). So, although the change in index size and node topology has altered the results, the performance advantage still seems to be with ES.

I’m not sure there is much point continuing these tests, as they don’t have particularly significant implications for real world applications. It was quite fun to do, though, and it’s clear that ES has greatly improved its performance relative to Solr compared with our 2014 study.

Figure 1: Search speed, no indexing load

Figure 2: Search speed, with indexing load

Figure 3: QPS, with and without concurrent indexing

The post Elasticsearch vs. Solr performance: round 2.1 appeared first on Flax.

Elasticsearch vs. Solr performance: round 2

Tom — Wed, 02 Dec 2015 18:32:15 +0000

About a year ago we carried out some performance comparison tests of Solr (version 4.10) and Elasticsearch (version 1.2) and presented our results at search meetups. Our conclusion was that there was not a great deal of difference. Both search engines had more than adequate performance for the vast majority of applications, although Solr performed rather better with complex filter queries and supported a significantly higher QPS.

Since then, both Solr and Elasticsearch have released new major versions, so it seemed like a good time to do another comparison. This time, under our present test conditions, the results were reversed. Elasticsearch 2.0.0 was substantially faster than Solr 5.3.1 for searching, and could maintain more than twice the QPS under indexing load. Solr, on the other hand, supported much faster indexing and used less disk space for its indexes.

For various reasons, it was not practical to exactly duplicate the original tests. We used a single Amazon EC2 r3.4xlarge instance backed by 400GB of EBS storage. Both the Elasticsearch or Solr nodes and the test scripts were run on the same machine. The configuration of both search engines was as follows:

4 nodes
4 index shards
no replication
16GB per node

Since the EC2 instance had 122GB of memory, this left 58GB for disk cache (minus whatever was used by the OS and the test scripts). The built-in instance of ZooKeeper was used for managing SolrCloud.

We created 20 million random documents, using (as before) a Markov chain trained on a document on the philosophy of Stoicism downloaded from gutenberg.org. Random integers were also generated for use as filters and facets. In both search engines, the text was indexed but not stored. Other fields were both indexed and stored (in Elasticsearch, _source and _all were disabled).

We indexed the documents into Elasticsearch or Solr using two concurrent instances of a Python script, with a batch size of 1000 documents. Solr was configured to do a soft commit every 1s, to be consistent with the default Elasticsearch behaviour. The elapsed indexing time for Solr was 66m 52s, while Elasticsearch took more than twice as long, at 142m 2s. The total index sizes were 38GB and 79GB respectively (update: see comments, this may be a mistake).

After indexing, we carried our search time tests under conditions of indexing load (2 processes as before) or no load. We also performed QPS tests in the loaded condition. In all cases, queries were composed of 3 OR terms, with three filters, and facets generated from the numeric terms. 5000 searches were run before all tests, to warm caches.

With no concurrent indexing load, Elasticsearch had a mean search time of 0.10s, with 99% of searches under 0.22s. For Solr, this was 0.12s and 0.54s (see figure 1). With an indexing load, Elasticsearch had a mean search time of 0.14s, with 99% below 0.34s. The same figures for Solr were 0.24s and 0.68s (figure 2). QPS tests were carried out with 1, 2, 4, 8, 16 and 32 concurrent search clients. Elasticsearch approached a maximum of 30 QPS, while Solr approached 15 (figure 3).

Figure 1: search time, no indexing load

Figure 2: search time, with indexing load

Figure 3: QPS with indexing load

Thus, under indexing load, Elasticsearch appeared to have approximately twice the search performance of Solr. It wasn’t clear why this might be the case. One idea which occurred to us was that the query execution changes announced for Elasticsearch 2 might be responsible. To test this, we also compared Elasticsearch 1.7.3 against the current version. The older version was slightly slower (99% of searches under 0.50s as opposed to 0.34s, figure 4) but this was a smaller difference than for Solr. The QPS test was inconclusive (figure 5).

Figure 4: Search time, ES 1.7 vs. 2.0

Figure 5: QPS, ES 1.7 vs 2.0

These results must be interpreted with caution. The first caveat is that we only tested a narrow range of either engine’s functionality, in one specific configuration. Other functional areas may perform very differently. Second, the runtime environment was fairly unrealistic. In practice, network latency and bandwidth are likely to have an effect on performance. Third, both engines were used more or less “out of the box”, with minimal effort put into tuning the performance of either.

There are also many factors other than raw performance to be taken into consideration when choosing a search engine. We are not saying that either choice is “better” than the other in all circumstances. However, if search performance is a critical factor in your system design, then it would pay to try both Solr and Elasticsearch, to see which would work better within your parameters.

The difference in performance in this study is interesting, but the reasons remain unclear, and we’d welcome any suggestions to refine our methodology, or similar studies.

The post Elasticsearch vs. Solr performance: round 2 appeared first on Flax.

Elasticsearch Percolator & Luwak: a performance comparison of streamed search implementations

Tom — Mon, 27 Jul 2015 07:04:25 +0000

Most search applications work by indexing a relatively stable collection of documents and then allowing users to perform ad-hoc searches to retrieve relevant documents. However, in some cases it is useful to turn this model on its head, and match individual documents against a collection of saved queries. I shall refer to this model as “streamed search”.

One example of streamed search is in media monitoring. The monitoring agency’s client’s interests are represented by stored queries. Incoming documents are matched against the stored queries, and hits are returned for further processing before being reported to the client. Another example is financial news monitoring, to predict share price movements.

In both these examples, queries may be extremely complex (in order to improve the accuracy of hits). There may be hundreds of thousands of stored queries, and documents to be matched may be incoming at a rate of hundreds or thousands per second. Not surprisingly, streamed search can be a demanding task, and the computing resources required to support it a significant expense. There is therefore a need for the software to be as performant and efficient as possible.

The two leading open source streamed search implementations are Elasticsearch Percolator, and Luwak. Both depend on the Lucene search engine. As the developers of Luwak, we have an interest in how its performance compares with Percolator. We therefore carried out some preliminary testing.

Ideally, we would have used real media monitoring queries and documents. However, these are typically protected by copyright, and the queries represent a fundamental asset of monitoring companies. In order to make the tests distributable, we chose to use freely dowloadable documents from Wikipedia, and to generate random queries. These queries were much simpler in structure than the often deeply nested queries from real applications, but we believe that they still provide a useful comparison.

The tests were carried out on an Amazon EC2 r3.large VM running Ubuntu. We wrote a Python script to download, parse and store random Wikipedia articles, and another to generate random queries from the text. The query generator was designed to be somewhat “realistic”, in that each query should match more than zero documents. For Elasticsearch, we wrote scripts to index queries into the Percolator and then run documents through it. Since Luwak has a Java API (rather than Elasticsearch’s RESTful API), we wrote a minimal Java app to do the same.

10,000 documents were downloaded from Wikipedia, and 100,000 queries generated for each test. We generated four types of query:

Boolean with 10 required terms and 2 excluded terms
Boolean with 100 required terms and 20 excluded terms
20 required wildcard terms, with a prefix of 4 characters
2-term phrase query with slop of 5

We ran the tests independently, giving Luwak and Elasticsearch a JVM heap size of 8GB, and doing an initial pre-run in order to warm the OS cache (this did not actually have a noticable effect). For sanity, we checked that each document matched the same queries in both Luwak and Percolator.

The results are shown in the graphs below, where the y-axis represents average documents processed per second.

Luwak was consistently faster than Percolator, ranging from a factor of 6 (for the phrase query type) to 40 (for the large Boolean queries).

The reason for this is almost certainly due to Luwak’s presearcher. When a query is added to Luwak, the library generates terms to index the query. For each incoming document, a secondary query is constructed and run against the query index, which returns a subset of the entire query set. Each of these is then run against the document in order to generate the final results. The effect of this is to reduce the number of primary queries which have to be executed against the document, often by a considerable factor (at a relatively small cost of executing the secondary query). Percolator does not have this feature, and by default matches every primary query against every document (it would be possible, but not straightforward, for an application to implement a presearch phase in Percolator). Supporting this analysis, when the Luwak presearcher was disabled its performance dropped to about the same level as Percolator.

These results must be treated with a degree of caution, for several reasons. As already explained, the queries used were randomly generated, and far simpler in structure than typical hand-crafted monitoring queries. Furthermore, the tests were single-threaded and single-sharded, whereas a multithreaded, multi-shard, distributed architecture would be typical for a real-world system. Finally, Elasticsearch Percolator is a service providing a high-level, RESTful API, while Luwak is much lower level, and would require significantly more application-level code to be implemented for a real installation.

However, since both Luwak and Percolator use the same underlying search technology, it is reasonable to conclude that the Luwak presearcher can give it a considerable performance advantage over Percolator.

If you are already using Percolator, should you change? If performance is not a problem now and is unlikely to become a problem, then the effort required is unlikely to be worth it. Luwak is not a drop-in replacement for Percolator. However, if you are planning a data-intensive streaming search system, it would be worth comparing the two. Luwak works well with existing high-performance distributed computing frameworks, which would enable applications using it to scale to very large query sets and document streams.

Our test scripts are available here. We would welcome any attempts to replicate, extend or contest our results.

The post Elasticsearch Percolator & Luwak: a performance comparison of streamed search implementations appeared first on Flax.