Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.Read about the first day here.
The second day started with Eric Pugh’s second talk on The (Unofficial) State of Elasticsearch, bringing us all up to date on the meteoric rise of this technology and the opportunities it opens up especially in analytics and visualisation. Eric foresees Elastisearch continuing to specialise in this area, with Solr sticking closer to its roots in information retrieval. Giovanni Tumarello followed with a fast-paced demonstration of Kibi, a platform built on Elasticsearch and Kibana. Kibi allows one to very quickly join, visualise and explore different data sets and I was impressed with the range of potential applications including in the life sciences.
Evan Bolton of the US-based NCBI was next, talking about the massive PubChem dataset (80 million unique chemical structures, 200 million chemical substance descriptions, and 230 million biological activities, all heavily crosslinked). Although both Solr and CLucene had been considered, they eventually settled on the Sphinx engine with its great support for SQL queries and JOINs, although Evan admitted this was not a cloud-friendly solution. His team are now considering knowledge graphs and how to present up to 100 billion RDF triples. Andrea Pierleoni of the Centre for Therapeutic Target Validation then talked about an Elasticsearch cluster he has developed to index ‘evidence strings’ (which relate targets to diseases using evidence). This is a relatively small collection of 2.1 million association objects, pre-processed using Python and stored in Redis before indexing.
Next up was Nikos Marinos from the EBI Literature Services team talking about their recent migration from Lucene to Solr. As he explained most of this was a straightforward task, with one wrinkle being the use of DIH Transformers where array data was used. Rafael Jimenez then talked about projects he has worked on using both Elasticsearch and Solr, and stressed the importance of adhering to open standards and re-use of software where possible – key strengths of open source of course. Michal Nowotka then talked about a proposed system to replace the current ChEMBL search using Solr and django-haystack (the latter allows one to use a variety of underlying search engines from Django). Finally, Nicola Buso talked about EBISearch, based on Lucene.
We then concluded with another hands-on session, more aimed at Elasticsearch this time. As you can probably tell we had been shown a huge variety of different search needs and solutions using a range of technologies over the two days and it was clear to me that the BioSolr project is only a small first step towards improving the software available – we have applied for further funding and we hope to have good news soon! Working with life science data, often at significant scale, has been fascinating.
Most of the presentations are now available for download. Thanks to all the presenters (especially those who travelled from abroad), the EBI for kindly hosting the event and in particular to Dr Sameer Velankar who has been the driving force behind this project.