Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others

Charlie Hull — Mon, 15 Feb 2016 11:32:13 +0000

Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.Read about the first day here.

The second day started with Eric Pugh’s second talk on The (Unofficial) State of Elasticsearch, bringing us all up to date on the meteoric rise of this technology and the opportunities it opens up especially in analytics and visualisation. Eric foresees Elastisearch continuing to specialise in this area, with Solr sticking closer to its roots in information retrieval. Giovanni Tumarello followed with a fast-paced demonstration of Kibi, a platform built on Elasticsearch and Kibana. Kibi allows one to very quickly join, visualise and explore different data sets and I was impressed with the range of potential applications including in the life sciences.

Evan Bolton of the US-based NCBI was next, talking about the massive PubChem dataset (80 million unique chemical structures, 200 million chemical substance descriptions, and 230 million biological activities, all heavily crosslinked). Although both Solr and CLucene had been considered, they eventually settled on the Sphinx engine with its great support for SQL queries and JOINs, although Evan admitted this was not a cloud-friendly solution. His team are now considering knowledge graphs and how to present up to 100 billion RDF triples. Andrea Pierleoni of the Centre for Therapeutic Target Validation then talked about an Elasticsearch cluster he has developed to index ‘evidence strings’ (which relate targets to diseases using evidence). This is a relatively small collection of 2.1 million association objects, pre-processed using Python and stored in Redis before indexing.

Next up was Nikos Marinos from the EBI Literature Services team talking about their recent migration from Lucene to Solr. As he explained most of this was a straightforward task, with one wrinkle being the use of DIH Transformers where array data was used. Rafael Jimenez then talked about projects he has worked on using both Elasticsearch and Solr, and stressed the importance of adhering to open standards and re-use of software where possible – key strengths of open source of course. Michal Nowotka then talked about a proposed system to replace the current ChEMBL search using Solr and django-haystack (the latter allows one to use a variety of underlying search engines from Django). Finally, Nicola Buso talked about EBISearch, based on Lucene.

We then concluded with another hands-on session, more aimed at Elasticsearch this time. As you can probably tell we had been shown a huge variety of different search needs and solutions using a range of technologies over the two days and it was clear to me that the BioSolr project is only a small first step towards improving the software available – we have applied for further funding and we hope to have good news soon! Working with life science data, often at significant scale, has been fascinating.

Most of the presentations are now available for download. Thanks to all the presenters (especially those who travelled from abroad), the EBI for kindly hosting the event and in particular to Dr Sameer Velankar who has been the driving force behind this project.

The post Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others appeared first on Flax.

Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr

Charlie Hull — Wed, 10 Feb 2016 10:26:00 +0000

The day started with a quick recap of the project from myself and Dr. Sameer Valenkar of the EBI. Eric Pugh, founder of Flax’s US partners Open Source Connections, followed with his Unofficial State of Solr, detailing the history of the project, recent innovations and what might happen in the future, including some very interesting new features allowing for parallel SQL queries. We then heard from Flax team members Tom Winch and Matt Pearce on how they have built faceting improvements, a new XJoin between Solr and external systems, researched federated search and developed ontology indexers (note that all of the software they’ve built is available as open source, and Tom has recently written extensively about XJoin).

After lunch we heard from Peter Meric of the NCBI (the US equivalent of the EBI) on a Solr-based system for searching gene data, to supplement the NCBI’s homegrown Entrez system. This is very much a filtered search rather than a text search and indexes around 330m records. He also talked about a High Availability prototype of a replacement for the very high traffic PubMed service built on Amazon Web Services. Each Solr, MongoDB or Zookeeper node ‘announces’ itself using a monitor service and then replicates data from a master node. Although it is not yet available as open source I think this project may be of great interest to the wider Solr community and I hope we hear more of it soon.

Next up was a brief talk by Dan Bolser of the EBI on an ‘old school’ scheme for sharding plant phenotype data – I’d seen part of this presentation before and it’s linked to our own ideas on federating search across bioinformatics data. Dan was followed by Lewis Geer of NCBI talking about the SEQR protein similarity search engine built on Solr. Although somewhat complex for us non-biologists to understand, this very clever system relies on experimental results to suggest which of the possible variants of a protein system are likely, and adds these to the Solr index – it reminded me of a similar approach we’ve used to store possible OCR errors when working with scanned newsprint. His team’s code is available. Dan Stainer of the Ensembl project was next discussing how his team are indexing tens of thousands of genomes from thousands of species, currently on a MySQL backend with a REST API and a lot of Perl. He discussed how they have been experimenting with Elasticsearch to index around 3.2bn items, creating a 782GB index which builds in around 5-6 hours, to provide new capabilities such as structured queries for their genome browser tools.

We then held an interactive hands-on session, covering subjects such as ‘getting started with Solr’ and exploring some of the code we’ve built such as XJoin, followed by a conference dinner in Hinxton Hall. It was clear that there is a huge range of use cases for search technology in the life sciences community and almost as many different ways to address them, and the after-dinner conversation was lively and highly interesting!

Most of the presentations are now available for download and we’ve also written about the second day of the event, where we shifted focus onto Elasticsearch and other technologies.

The post Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr appeared first on Flax.

biology – Flax

Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others

Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr