As any regular reader of this blog will be aware, we use almost exclusively open source software on customer projects. To meet their requirements, we often have to extend the functionality of the software (e.g. XJOIN in Solr). As far as possible, with the agreement of the customer, we like to then contribute these changes back to the community. This is one of the great positive-sum strengths of open source: the customer has their problem solved, the community benefits from an improved product, and the developers get paid.
However, these cases usually involve extending the functionality of, e.g., Apache Solr, in a specific way. In contrast, there are several known areas of the Solr codebase which we believe could do with some attention, but which do not directly concern functionality. These areas affect code quality and maintainability, and fixing them would potentially benefit future releases of Solr. There are five areas which we have identified as particularly important. We’d love to know the views of the wider Solr community of course – please do comment! We’ve added in italics some idea of the complexity of each task (in our view).
1. Move all SolrCloud unit tests over to use the SOLR-8758 patch, which will make testing faster and more likely to catch problems – easy, but time-consuming. We have a sponsor! Huge thanks to Invotra.
2. Make DirectoryFactory implementations take a Path at construction time to give them their root filesystem. This is useful for a couple of reasons: it means that tests can use the various Lucene mock filesystems to check for error handling, and it helps prevent OS-specific bugs that can appear when path-resolution logic is scattered around the place. This is described in SOLR-8282 – hard.
3. Factor out HDFS classes into a contrib module. Currently these are bundled with stock Solr, which means that downloads are much larger than they need to be in the vast majority of cases – medium-hard, we’re currently working on this.
4. Improve SpanQuery scoring by adding explicit support for this to the Lucene Similarity classes. This would reduce API complexity and make ranking for complex proximity queries more accurate – hard.
5. Improve the ValueSource API by making it type-safe – not too hard, we’re working on this gradually.
We are currently working on these issues when possible, but client projects take priority for obvious reasons, and there is a risk that this work will continue to slip indefinitely. We would therefore be interested to find out if there are any corporate users of Solr who would like to contribute to the cost of development of any of the above issues. By doing so, you would be improving Solr both for yourself and for the communities of users and developers. And of course, we will give you full credit here, in the Lucene/Solr JIRA system, on social media and at our London Lucene/Solr Meetups.
If you’re interested in helping, please contact us.