Boosts Considered Harmful – adventures with badly configured search

Charlie Hull — Fri, 19 Aug 2016 13:10:10 +0000

During a recent client visit we encountered a common problem in search – over-application of ‘boosts’, which can be used to weight the influence of matches in one particular field. For example, you might sensibly use this to make results that match a query on their title field come higher in search results. However in this case we saw huge boost values used (numbers in the hundreds) which were probably swamping everything else – and it wasn’t at all clear where the values had come from, be it experimentation or simply wild guesses. As you might expect, the search engine wasn’t performing well.

A problem with both Solr, Elasticsearch and other search engines is that so many factors can affect the ordering of results – the underlying relevance algorithms, how source data is processed before it is indexed, how queries are parsed, boosts, sorting, fuzzy search, wildcards…it’s very easy to end up with a confusing picture and configuration files full of conflicting settings. Often these settings are left over from example files or previous configurations or experiments, without any real idea of why they were used. There are so many dials to adjust and switches to flick, many of which are unnecessary. The problem is compounded by embedding the search engine within another system (e.g. a content management platform or e-commerce engine) so it can be hard to see which control panel or file controls the configuration. Generally, this embedding has not been done by those with deep experience of search engines, so the defaults chosen are often wrong.

The balance of relevance versus recency is another setting which is often difficult to get right. At a news site we were asked to bias the order of results heavily in favour of recency (as the saying goes, yesterday’s newspaper is today’s chip wrapper) – the result being, as we had warned, that whatever the query today’s news would appear highest – even if it wasn’t relevant! Luckily by working with the client we managed to achieve a sensible balance before the site was launched.

Our approach is to strip back the configuration to a very basic one and to build on this, but only with good reason. Take out all the boosts and clever features and see how good the results are with the underlying algorithms (which have been developed based on decades of academic research – so don’t just break them with over-boosting). Create a process of test-based relevancy tuning where you can clearly relate a configuration setting to improving the result of a defined test. Be clear about which part of your system influences a setting and whose responsibility it is to change it, and record the changes in source control.

Boosts are a powerful tool – when used correctly – but you should start by turning them off, as they may well be doing more harm than good. Let us know if you’d like us to help tune your search!

The post Boosts Considered Harmful – adventures with badly configured search appeared first on Flax.

Can you make a contribution to Apache Solr core development?

Charlie Hull — Tue, 26 Apr 2016 11:02:48 +0000

As any regular reader of this blog will be aware, we use almost exclusively open source software on customer projects. To meet their requirements, we often have to extend the functionality of the software (e.g. XJOIN in Solr). As far as possible, with the agreement of the customer, we like to then contribute these changes back to the community. This is one of the great positive-sum strengths of open source: the customer has their problem solved, the community benefits from an improved product, and the developers get paid.

However, these cases usually involve extending the functionality of, e.g., Apache Solr, in a specific way. In contrast, there are several known areas of the Solr codebase which we believe could do with some attention, but which do not directly concern functionality. These areas affect code quality and maintainability, and fixing them would potentially benefit future releases of Solr. There are five areas which we have identified as particularly important. We’d love to know the views of the wider Solr community of course – please do comment! We’ve added in italics some idea of the complexity of each task (in our view).

1. Move all SolrCloud unit tests over to use the SOLR-8758 patch, which will make testing faster and more likely to catch problems – easy, but time-consuming. We have a sponsor! Huge thanks to Invotra.

2. Make DirectoryFactory implementations take a Path at construction time to give them their root filesystem. This is useful for a couple of reasons: it means that tests can use the various Lucene mock filesystems to check for error handling, and it helps prevent OS-specific bugs that can appear when path-resolution logic is scattered around the place. This is described in SOLR-8282 – hard.

3. Factor out HDFS classes into a contrib module. Currently these are bundled with stock Solr, which means that downloads are much larger than they need to be in the vast majority of cases – medium-hard, we’re currently working on this.

4. Improve SpanQuery scoring by adding explicit support for this to the Lucene Similarity classes. This would reduce API complexity and make ranking for complex proximity queries more accurate – hard.

5. Improve the ValueSource API by making it type-safe – not too hard, we’re working on this gradually.

We are currently working on these issues when possible, but client projects take priority for obvious reasons, and there is a risk that this work will continue to slip indefinitely. We would therefore be interested to find out if there are any corporate users of Solr who would like to contribute to the cost of development of any of the above issues. By doing so, you would be improving Solr both for yourself and for the communities of users and developers. And of course, we will give you full credit here, in the Lucene/Solr JIRA system, on social media and at our London Lucene/Solr Meetups.

If you’re interested in helping, please contact us.

The post Can you make a contribution to Apache Solr core development? appeared first on Flax.

testing – Flax

Boosts Considered Harmful – adventures with badly configured search

Can you make a contribution to Apache Solr core development?