bioinformatics – Flax

Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others

Charlie Hull — Mon, 15 Feb 2016 11:32:13 +0000

Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.Read about the first day here.

The second day started with Eric Pugh’s second talk on The (Unofficial) State of Elasticsearch, bringing us all up to date on the meteoric rise of this technology and the opportunities it opens up especially in analytics and visualisation. Eric foresees Elastisearch continuing to specialise in this area, with Solr sticking closer to its roots in information retrieval. Giovanni Tumarello followed with a fast-paced demonstration of Kibi, a platform built on Elasticsearch and Kibana. Kibi allows one to very quickly join, visualise and explore different data sets and I was impressed with the range of potential applications including in the life sciences.

Evan Bolton of the US-based NCBI was next, talking about the massive PubChem dataset (80 million unique chemical structures, 200 million chemical substance descriptions, and 230 million biological activities, all heavily crosslinked). Although both Solr and CLucene had been considered, they eventually settled on the Sphinx engine with its great support for SQL queries and JOINs, although Evan admitted this was not a cloud-friendly solution. His team are now considering knowledge graphs and how to present up to 100 billion RDF triples. Andrea Pierleoni of the Centre for Therapeutic Target Validation then talked about an Elasticsearch cluster he has developed to index ‘evidence strings’ (which relate targets to diseases using evidence). This is a relatively small collection of 2.1 million association objects, pre-processed using Python and stored in Redis before indexing.

Next up was Nikos Marinos from the EBI Literature Services team talking about their recent migration from Lucene to Solr. As he explained most of this was a straightforward task, with one wrinkle being the use of DIH Transformers where array data was used. Rafael Jimenez then talked about projects he has worked on using both Elasticsearch and Solr, and stressed the importance of adhering to open standards and re-use of software where possible – key strengths of open source of course. Michal Nowotka then talked about a proposed system to replace the current ChEMBL search using Solr and django-haystack (the latter allows one to use a variety of underlying search engines from Django). Finally, Nicola Buso talked about EBISearch, based on Lucene.

We then concluded with another hands-on session, more aimed at Elasticsearch this time. As you can probably tell we had been shown a huge variety of different search needs and solutions using a range of technologies over the two days and it was clear to me that the BioSolr project is only a small first step towards improving the software available – we have applied for further funding and we hope to have good news soon! Working with life science data, often at significant scale, has been fascinating.

Most of the presentations are now available for download. Thanks to all the presenters (especially those who travelled from abroad), the EBI for kindly hosting the event and in particular to Dr Sameer Velankar who has been the driving force behind this project.

The post Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others appeared first on Flax.

Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr

Charlie Hull — Wed, 10 Feb 2016 10:26:00 +0000

The day started with a quick recap of the project from myself and Dr. Sameer Valenkar of the EBI. Eric Pugh, founder of Flax’s US partners Open Source Connections, followed with his Unofficial State of Solr, detailing the history of the project, recent innovations and what might happen in the future, including some very interesting new features allowing for parallel SQL queries. We then heard from Flax team members Tom Winch and Matt Pearce on how they have built faceting improvements, a new XJoin between Solr and external systems, researched federated search and developed ontology indexers (note that all of the software they’ve built is available as open source, and Tom has recently written extensively about XJoin).

After lunch we heard from Peter Meric of the NCBI (the US equivalent of the EBI) on a Solr-based system for searching gene data, to supplement the NCBI’s homegrown Entrez system. This is very much a filtered search rather than a text search and indexes around 330m records. He also talked about a High Availability prototype of a replacement for the very high traffic PubMed service built on Amazon Web Services. Each Solr, MongoDB or Zookeeper node ‘announces’ itself using a monitor service and then replicates data from a master node. Although it is not yet available as open source I think this project may be of great interest to the wider Solr community and I hope we hear more of it soon.

Next up was a brief talk by Dan Bolser of the EBI on an ‘old school’ scheme for sharding plant phenotype data – I’d seen part of this presentation before and it’s linked to our own ideas on federating search across bioinformatics data. Dan was followed by Lewis Geer of NCBI talking about the SEQR protein similarity search engine built on Solr. Although somewhat complex for us non-biologists to understand, this very clever system relies on experimental results to suggest which of the possible variants of a protein system are likely, and adds these to the Solr index – it reminded me of a similar approach we’ve used to store possible OCR errors when working with scanned newsprint. His team’s code is available. Dan Stainer of the Ensembl project was next discussing how his team are indexing tens of thousands of genomes from thousands of species, currently on a MySQL backend with a REST API and a lot of Perl. He discussed how they have been experimenting with Elasticsearch to index around 3.2bn items, creating a 782GB index which builds in around 5-6 hours, to provide new capabilities such as structured queries for their genome browser tools.

We then held an interactive hands-on session, covering subjects such as ‘getting started with Solr’ and exploring some of the code we’ve built such as XJoin, followed by a conference dinner in Hinxton Hall. It was clear that there is a huge range of use cases for search technology in the life sciences community and almost as many different ways to address them, and the after-dinner conversation was lively and highly interesting!

Most of the presentations are now available for download and we’ve also written about the second day of the event, where we shifted focus onto Elasticsearch and other technologies.

The post Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr appeared first on Flax.

The fun and frustration of writing a plugin for Elasticsearch for ontology indexing

Matt Pearce — Wed, 27 Jan 2016 10:15:11 +0000

As part of our work on the BioSolr project, I have been continuing to work on the various Elasticsearch ontology annotation plugins (note that even though the project started with a focus on Solr – thus the name – we have also been developing some features for Elasticsearch). These are now largely working, with some quirks which will be mentioned below (they may not even be quirks, but they seem non-intuitive to me, so deserve a mention). It’s been a slightly painful process, as you may infer from the use of italics below, and we hope this post will illustrate some of the differences between writing plugins for Solr and Elasticsearch.

It’s probably worth noting that at least some of this write-up is speculative. I’m not privy to the internals of Elasticsearch, and have been building the plugin through a combination of looking at the Elasticsearch source code (as advised by the documentation) and running the same integration test over and over again for each of the various versions, and checking what was returned in the search response. There is very little in the way of documentation, and the 1.x versions of Elasticsearch have almost no comments or Javadoc in the code. It has been interesting and fun, and not at all exasperating or frustrating.

The code

The plugin code can be broken down into three broad sections:

A core module, containing code shared between the Elasticsearch and Solr versions of the plugin. Anything in this module should be search engine agnostic, and is dedicated to accessing and pulling data from ontologies, either via the OLS service (provided by the European Bioinformatics Institute, our partners in the BioSolr project) or more generally OWL files, and returning a structure which can be used by the plugins.
The es-ontology-annotator-core module, which is shared between all versions of the plugin, and contains Elasticsearch-specific code to build the helper classes required to access the ontology data.
The es-ontology-annotator-esx.x modules, which are specific to the various versions of Elasticsearch. So far, there are five of these (one of the more challenging aspects of this work has been that the Elasticsearch mapper structure has been evolving through the versions, as has some of the internal infrastructure supporting them):
- 1.3 – for ES 1.3
- 1.4 – for ES 1.4
- 1.5 – for ES 1.5 – 1.7
- 2.0 – for ES 2.0
- 2.1 – for ES 2.1.1
- 2.2 – for ES 2.2

I haven’t tried the plugin with any versions of ES earlier than 1.3. There was a change to the internal mapping classes between 1.4 and 1.5 (UpdateInPlaceHashMap was removed and replaced with CopyOnWriteHashMap), presumably for a Very Good Reason. Versions since 1.5 seem to be forward compatible with later 1.x versions.

The quirks

All of the versions of the plugin work in the same way. You specify in your mapping that a particular field has the type “ontology”. There are various additional properties that can be set, depending on whether you’re using an OWL file or OLS as your ontology data source (specified in the README). When the data is indexed, any information in that field is assumed to be an IRI referring to an ontology record, and will be used to fetch as much data as required/possible for that ontology record. The data will then be added as sub-fields to the ontology fields.

The new data is not added to the _source field, which is the easy way of seeing what data is in a stored record. In order to retrieve the new data, you have two options:

Grab the mapping for your index, and look through it for the sub-fields of your annotation field. Use as many of these as you need to populate the fields property in your search request, making sure you name them fully (ie. annotation.uri, annotation.label, annotation.child_uris).
Add all of the fields to the fields property in your search request (ie. "fields": [ "*" ]).

What you cannot do is add “annotation.*” to your search request to get all of the annotation subfields. At this stage, this doesn’t work. I’m still working out whether this is possible or not.

How it works

All of the versions work in a broadly similar fashion: the OntologyMapper class extends AbstractFieldMapper (Elasticsearch 1.x) or FieldMapper (Elasticsearch 2.x). The Mapper classes all have two internal classes:

a TypeParser, which reads the mapper’s configuration from the mapping details (as initially specified by the user, and as also returned from the Mapper.toXContent method), and returns…
a Builder, which constructs the mappers for the known sub-fields and ultimately builds the Mapper class. The sub-field mappers are all for string fields, with mappers for URI fields having tokenisation disabled, while the other fields have it enabled. All are both indexed and stored.

The Mapper parses the content of the initial field (the IRI for the ontology record), and adds the sub-fields to the record, as part of the Mapper.parse method call (this is the most significant part of the Mapper code). There are at least two ways of doing this, and the Elasticsearch source code has both depending on which Mapper class you look at. There is no indication in the source why you would use one method over the other. This helps with clarity, especially when things aren’t working as they should.

What makes life more interesting for the OntologyMapper class is that not all of the sub-fields are known at start time. If the user wishes to index additional relationships between nodes (“participates in”, “has disease location”, etc.), these are generated on the fly, and the sub-fields need to be added to the mapping. Figuring out how to do this, and also how to make sure those fields are returned when the use requests the mapping for the index, has been a particular challenge.

The TypeParser is called more than once during the indexing process. My initial assumption was that once the mapping details had been read from the user’s specification, the parser was “fixed,” and so you had to keep track of the sub-field mappers yourself. This is not the case. As noted above, the TypeParser can also be fed from the Mapper’s toXContent method (which generates the mapping seen when you call the _mapping endpoint). Elasticsearch versions 1.x didn’t seem to care particularly what toXContent returned, so long as it could be parsed without throwing a NullPointerException, but Elasticsearch versions 2.x actually check that all of the mapping configuration has been dealt with. This actually makes life easier internally – after the mapper has processed a record, at least some of the dynamic field mappings are known, so you can build the sub-field mappers in the Builder rather than having to build them on the fly during the Mapper.parse process.

The other non-trivial Mapper methods are:

toXContent, as mentioned several times already. This generates the mapping output (ie. the definition of the field as seen when you look via the _mapping endpoint).
merge, which seems to do a compatibility check between an incoming instance of the mapper and the current instance. I’ve added some checks to this, but no significant code. Several of the implementations of this method in the Elasticsearch source code simply contain comments to the effect of “will return to this later”, so it seems I’m not the only person who doesn’t understand how merge works, or why it is called.
traverse (Elasticsearch 1.x) and iterator (Elasticsearch 2.x), which seem to do similar things – namely providing a means to iterate through the sub-field mappers. In Elasticsearch 1.x, the traverse method is explicitly called as part of the process to add the new (dynamic) mappers to the mapping, but this isn’t a requirement for Elasticsearch 2.x. Elasticsearch 1.x distinguished between ObjectMappers and FieldMappers, which doesn’t seem to be a distinction in Elasticsearch 2.x.

Comparisons with the Solr plugin

The Solr plugin works somewhat differently to the Elasticsearch one. The Solr plugin is implemented as an UpdateRequestProcessor, and adds new fields directly to the incoming record (it doesn’t add sub-fields). This makes the returned data less tidy, but also easier to handle, since all of the new fields have the same prefix and can therefore be handled directly. You don’t need to explicitly tell Solr to return the new fields – because they are all stored, they are all returned by default.

On the other hand, you still have to jump through some hoops to work out which fields are dynamically generated, if you need to do that (i.e. to add checkboxes to a form to search “has disease location” or other relationships) – you need to call Solr to retrieve the schema, and use that as the basis for working out which are the new fields. For Elasticsearch, you have to request the mapping for your index, and use that in a similar way.

Configuration in Solr requires modifying the solrconfig.xml, once the plugin JAR file is in place, but doesn’t require any changes to the schema. All of the Elasticsearch configuration happens in the mapping definition. This reflects the different ways of implementing the plugin for Solr. I don’t have a particular feeling for whether it would have been better to implement the Solr plugin as a new field type – I did investigate, and it seemed much harder to do this, but it might be worth re-visiting if there is time available.

The Solr plugin was much easier to write, simply because the documentation is better. The Solr wiki has a very useful base page for writing a new UpdateRequestProcessor, and the source code has plenty of comments and Javadoc (although it’s not perfect in this respect – SolrCoreAware has no documentation at all, has been present since Solr 1.3, and was a requirement for keeping track of the Ontology helper threads).

I will most likely update this post as I become aware of things I have done which are wrong, or any misinformation it contains. We’ll also be talking further about the BioSolr project at a workshop event on February 3rd/4th 2016. We welcome feedback and comments, of course – especially from the wider Elasticsearch developer community.

The post The fun and frustration of writing a plugin for Elasticsearch for ontology indexing appeared first on Flax.

XJoin for Solr, part 1: filtering using price discount data

Tom Winch — Mon, 25 Jan 2016 10:04:28 +0000

In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an external web API, to either filter the results of a product search or boost scores. A further post will show another example, using click-through data to influence the score of subsequent searches.

What is XJoin?

The XJoin component can be used when you want values from some source external to Solr to filter or influence the score of hits in your Solr result set. It is currently available as a Solr patch on the XJoin JIRA ticket SOLR-7341, so to use it, you’ll need to check out a version of Apache Lucene/Solr using Subversion, then patch and build it (see below for details).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

Patching SOLR

I’m going to be using Solr version 5.3 for this blog. If you’re following along, check out a clean copy using Subversion:

$ svn co https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3

Download the XJoin patch (find the one corresponding to this version of Solr on the JIRA ticket) into the newly checked-out directory, and apply it:

lucene_solr_5_3$ svn patch SOLR-7341.patch-5_3

And then build Solr from the solr sub-directory:

lucene_solr_5_3/solr$ ant server

We should now be able to start the patched Solr server:

lucene_solr_5_3/solr$ bin/solr start

Indexing a sample product data set

I’ll be using a sample Google product feed, GoogleProducts.csv, which I got from here. Create a new directory called blog (mine has the same parent as my Solr check-out) and download the sample into it. It’s in CSV format, with columns for product id, name, description, manufacturer and price. Indexing this will be a piece of cake!

We’ll begin with a copy of the sample Solr config directory:

blog$ cp -r ../lucene_solr_5_3/solr/server/solr/configsets/basic_configs/conf .

Modify conf/schema.xml so that our Solr documents have fields corresponding to those in the CSV file:

Naturally, the product id will serve as the Solr unique key:

id

We can use the sample solrconfig.xml as is for now. Add a core called products using the Solr core admin UI (as you started a Solr server above, this should be available at http://localhost:8983/solr/#/~cores). The values for instanceDir and dataDir will both be the full path of the blog directory.

I’ll be using Python to index the product data. The code is written for Python 3, and won’t work in Python 2.x because of character encoding issues in the csv module, but you can fix it by using a UTF8Recoder as described in the module documentation. Here’s my indexing script (note that all the code written for this example is also available in the BioSolr GitHub repository):

import sys
import csv
import json
import requests

def value(k, v):
    return k, v.strip() if k != 'price' else float(v.split()[0])

def read(path):
    with open(path, encoding='iso-8859-1') as f:
        reader = csv.DictReader(f)
        for doc in reader:
            yield dict(value(k, v) for k, v in doc.items()
                       if len(v.strip()) > 0)

def index(url, docs):
    print("Sending {0} documents to {1}".format(len(docs), url))
    data = json.dumps(docs)
    headers = { 'content-type': 'application/json' }
    r = requests.post(url, data=data, headers=headers)
    if r.status_code != 200:
      raise IOError("Bad SOLR update")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: {0}  ".format(sys.argv[0]))
        sys.exit(1)

    docs = list(read(sys.argv[2]))
    index(sys.argv[1], docs)

The script tidies up the prices because they aren’t consistently formatted, converting them to float values. Save the script in index.py and use it to index the Google product data into Solr (let’s force commits, just to be sure):

blog$ python3 index.py http://localhost:8983/solr/products/update?commit=true GoogleProducts.csv

And, lo and behold, we can see our data in Solr using cURL (I like to pipe the output through jq to get nicely formatted JSON):

curl 'localhost:8983/solr/products/select?wt=json&q=*' | jq .

So, using Solr we’ve now built a full text product search in only a few minutes, with potentially all the add-ons Solr provides out of the box. However, suppose there is supplementary information about the products, available from an external source (which might not be under our control).

I will now demonstrate how to configure Solr so that during a product search, the external source is also queried (either with the same user query or something different) and the resulting external data used to influence the result set. Each external result is ‘joined’ against a Solr document via a ‘join field’ or ‘join id’, which doesn’t have to be the Solr unique id (in the examples below I use the product id and manufacturer as the join fields). To get an ‘inner join’ I will use the XJoinQParserPlugin to turn the external ids into a filter query, but it’s also possible to build boost queries or use the XJoinValueSourceParser to use external values in a boost function. You can see all this implemented below.

Product discount offers example

In the first of my examples, I’ll set up filtering and score boosting based on discount offers, the external source for which is going to be a web service, which I’m going to make available locally on the URL http://localhost:8000/offers. Again, I’ll implement this in Python, using the popular Flask web server micro-framework and the module requests. Install both of these using pip (I need sudo, but you might not):

blog$ sudo pip install flask requests

Creating the external source

Here’s my code for the product offers web API:

from flask import Flask
from index import read
import json
import random
import sys

app = Flask(__name__)

@app.route('/')
def main():
    return json.dumps({ 'info': 'product offers API' })

@app.route('/products')
def products():
    offer = lambda doc: {
                'id': doc['id'],
                'discountPct': random.randint(1, 80)
            }
    return json.dumps([offer(doc) for doc
                       in random.sample(app.docs, 64)])

@app.route('/manufacturers')
def manufacturer():
  manufacturers = set(doc['manufacturer'] for doc in app.docs
                      if 'manufacturer' in doc)
  deal = lambda m: {
             'manufacturer': m,
             'discountPct': random.randint(1, 10) * 5
         }
  return json.dumps([deal(m) for m
                     in random.sample(manufacturers, 3)])

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print("Usage: {0} ".format(sys.argv[0]))
    sys.exit(1)

  app.docs = list(read(sys.argv[1]))
  app.run(port=8000, debug=True)

The code generates discounts for a random selection of products and manufacturers. Save it to blog/offer.py and start the server, supplying the Google products CSV file on the command line:

blog$ python3 offer.py GoogleProducts.csv

Now, test it out using cURL (again, I like to pipe through jq to get nicely formatted JSON):

$ curl -s localhost:8000/products | jq .

You should see a list of objects, each with a product id and a discount percentage, something like:

[
  {
    "discountPct": 41,
    "id": "http://www.google.com/base/feeds/snippets/18100341066456401733"
  },
  {
    "discountPct": 63,
    "id": "http://www.google.com/base/feeds/snippets/16969493842479402672"
  },
  {
    "discountPct": 13,
    "id": "http://www.google.com/base/feeds/snippets/10357785197400989441"
  },
  {
    "discountPct": 35,
    "id": "http://www.google.com/base/feeds/snippets/2813321165033737171"
  },
  {
    "discountPct": 27,
    "id": "http://www.google.com/base/feeds/snippets/15203735208016659510"
  },
  ...
]

You get similar output if you use the /manufacturers endpoint:

$ curl -s localhost:8000/manufacturers | jq .

This time, we get a shorter list, of manufacturers each with a discount percentage, for example:

[
  {
    "discountPct": 15,
    "manufacturer": "freeverse software"
  },
  {
    "discountPct": 5,
    "manufacturer": "pinnacle systems"
  },
  {
    "discountPct": 50,
    "manufacturer": "destineer inc"
  }
]

Creating XJoin glue code

To bridge the gap between Solr and our external data source, XJoin requires some glue code, written in Java, to query the source and return the results. First, I’ll create a quick utility class to help with HTTP connections:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import javax.json.Json;
import javax.json.JsonReader;
import javax.json.JsonStructure;

public class HttpConnection implements AutoCloseable {
  private HttpURLConnection http;
  
  public HttpConnection(String url) throws IOException {
    http = (HttpURLConnection)new URL(url).openConnection();
  }
  
  public JsonStructure getJson() throws IOException {
    http.setRequestMethod("GET");
    http.setRequestProperty("Accept", "application/json");
    try (InputStream in = http.getInputStream();
         JsonReader reader = Json.createReader(in)) {
      return reader.read();
    }
  }
  
  @Override
  public void close() {
    http.disconnect();
  }
}

Save this as blog/java/uk/co/flax/examples/xjoin/HttpConnection.java. The glue code we need is fairly simple, and can be written as a single class, implementing the XJoinResultsFactory interface:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class OfferXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  private String field;
  private String discountField;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
    field = (String)args.get("field");
    discountField = (String)args.get("discountField");
  }

  /**
   * Use 'offers' REST API to fetch current offer data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    try (HttpConnection http = new HttpConnection(url)) {
      JsonArray offers = (JsonArray)http.getJson();
      return new OfferResults(offers);
    }
  }
   
  /**
   * Results of the external search - methods like getXXX() are used
   * to expose the property XXX in the SOLR results.
   */
  public class OfferResults implements XJoinResults {
    private JsonArray offers;
    
    public OfferResults(JsonArray offers) {
      this.offers = offers;
    }
    
    public int getCount() {
      return offers.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      List ids = new ArrayList<>();
      for (JsonValue offer : offers) {
        ids.add(((JsonObject)offer).getString(field));
      }
      return ids;
    }

    @Override
    public Object getResult(String joinIdStr) {
      for (JsonValue offer : offers) {
        String id = ((JsonObject)offer).getString(field);
        if (id.equals(joinIdStr)) {
          return new Offer(offer);
        }
      }
      return null;
    }
  }
  
  /**
   * A discount offer - methods like getXXX() are used to expose
   * properties that can be joined with each Solr result via the join
   * id field.
   */
  public class Offer {
    private JsonValue offer;
    
    public Offer(JsonValue offer) {
      this.offer = offer;
    }
    
    public double getDiscount() {
      return ((JsonObject)offer).getInt(discountField) * 0.01d;
    }
  }
}

Here, the init() method initialises the URL for the external API and the names of the values we want to pick out from the external data. The getResults() method connects to the external API – since in this example, the discounts do not depend on the user’s query, we don’t use the SolrParams argument at all. It returns an implementation of XJoinResults, which must be able to return a collection of join ids (so, the value of the join id field for each external result), and also be able to return an external result object given a join id. Together, the XJoinResults object and each external result object contain the results of the external search, exposed via getXXX() methods (which are mapped to properties called XXX) and (once everything is plumbed in) available to Solr for filtering, affected the scores of documents, or for inclusion in the results set.

Save the above as blog/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java. You’ll also need javax.json-1.0.4.jar, which you can download from here if you don’t already have it – place it in the blog directory. Compile the two Java source files, and create a JAR to contain the resulting .class files:

blog$ mkdir bin
blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java
blog$ jar cvf offer.jar -C bin .

Configuring XJoin

So now – at last! – I’ll configure a Solr query handler that uses the XJoin Solr plugin components to add filters and boost queries based on the external data.

I’ll be working with blog/conf/solrconfig.xml now. The first thing to do is include the contrib JARs for XJoin and our glue code JAR (offer.jar) in directives near the top of the config file. To do that, add in the following snippet just under the directive:

Here, you need to substitute /XXX with the full path to the parent of the blog directory. (We need to include javax.json-1.0.4.jar because it’s a dependency of our offer.jar.) Now for the request handler config – I’ll include everything we’re going to need even though it won’t all be used straightaway:




  discount
  0.0



  uk.co.flax.examples.xjoin.OfferXJoinResultsFactory
  id
  
    http://localhost:8000/products
    id
    discountPct
  



  uk.co.flax.examples.xjoin.OfferXJoinResultsFactory
  manufacturer
  
    http://localhost:8000/manufacturers
    manufacturer
    discountPct
  



  
    json
    all
    edismax
    description
    *

    false
    count
    *

    false
    count
    *
  
  
    x_product_offers
    x_manufacturer_offers
  
  
    x_product_offers
    x_manufacturer_offers

Insert this request handler config somewhere near the bottom of solrconfig.xml.

Using XJoin in a query

Let’s quickly get a query working, then I’ll explain what all the components that I’ve included do. Try this (remembering to escape curly brackets on the command line):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&fq=\{!xjoin\}x_product_offers&fl=id,name&rows=4' | jq .

You should see output like this (I’ve edited responseHeader.params for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 22,
    "params": {
      "x_product_offers": "true", 
      "x_product_offers.results": "count",
      "x_product_offers.fl": "*", 
      "q": "*", 
      "fq": "{!xjoin}x_product_offers", 
      "fl": "id,name",
      "rows": "4"
    }
  },
  "response": {
    "numFound": 64,
    "start": 0,
    "docs": [
      {
        "name": "did0480p-m311 plasmon additional maintenance 24x7 - plasmon diamond technical support - consul",
        "id": "http://www.google.com/base/feeds/snippets/13522752516373728128"
      },
      {
        "name": "apple ilife '06 family pack",
        "id": "http://www.google.com/base/feeds/snippets/10939909441298262260"
      },
      {
        "name": "adobe cs3 web standard upsell",
        "id": "http://www.google.com/base/feeds/snippets/8042583218932085904"
      },
      {
        "name": "the richard friedman trio motown hits - *(for the tg-100)*",
        "id": "http://www.google.com/base/feeds/snippets/17853905518738313346"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13522752516373728128",
        "doc": {
          "discount": 0.11
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/10939909441298262260",
        "doc": {
          "discount": 0.76
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/8042583218932085904",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17853905518738313346",
        "doc": {
          "discount": 0.05
        }
      }
    ]
  }
}

Here you can see the usual Solr output with our product documents in the response.docs array. Notice the value of response.numFound is only 64 out of a possible 3226. Additionally, we have an extra section, response.x_product_offers, that gives us results from the external offers API – count tells us the total number of external results found, and there is an external result object with a join id matching each hit in the Solr results.

The query we made to get these results is a combination of the parameters in the request handler, and those in the URL’s query string – I’ve left the pertinent ones in responseHeader.params. The first parameter, x_product_offers=true, turns on the XJoin component that talks to the offers API, so that at query time, it will make a connection and retrieve external results (note that in this case, no parameters are passed to the external API – the following blog post will demonstrate this). The following two parameters control which fields are output from the external results – the .results option is a field list which controls the fields returned from the OfferResults object (that’s our implementation of XJoinResults – see the code above – there is one OfferResults object per external request and it acts as a collection of the returned external results). Then the .fl option is another field list which controls the fields returned for each external result object – these values can be used for filtering, boosting, and so on (for more on which, see below).

The parameters q=*, fl=id,name and rows=4 have their usual effects. The really interesting parameter is the filter query:

fq={!xjoin}x_product_offers

This uses Solr local parameters “short-form” syntax to reference the XJoinQParserPlugin that was set up in solrconfig.xml (it doesn’t take any initialisation parameters). This component uses the join ids from the referenced XJoin component to create a query that ORs together terms like join_field:join_id (one for each external result). It is based on the Solr built-in TermsQParserPlugin and supports the same method parameter (but this can usually be omitted). So, here, it makes a filter based on the join ids returned by the offers API – thus, only the products which have a current offer are returned.

Note that we could have used the same syntax in just the q parameter to achieve the same effect, but it’s more usual that a user full text query is specified in q and a ‘join’ created using a filter query.

Using the XJoinValueSourceParser

The XJoinValueSourceParser component that we have configured in solrconfig.xml provides us with a function, discount, that we can use in a function query. I configured the component to extract the value of discount from external results, and we supply an XJoin component name as the argument – this is a reference to a set of external results.

This opens up lots of possibilities, for example, a search in which each product’s score is boosted by a reciprocal function of the price including discount (so cheaper products, after discounting, are boosted higher):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2&fl=id,price,score&rows=4' | jq .

which results in a response something like (again, with responseHeader.params edited for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 55,
    "params": {
       "x_product_offers": "true", 
       "x_product_offers.results": "count", 
       "x_product_offers.fl": "*", 
       "q": "*", 
       "bf": "recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2",
       "fl": "id,price,score",
       "rows": "4"
     }
   },
  "response": {
    "numFound": 3226,
    "start": 0,
    "maxScore": 1.3371909,
    "docs": [
      {
        "id": "http://www.google.com/base/feeds/snippets/549551716004314019",
        "price": 0.5,
        "score": 1.3371909
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "price": 8.49,
        "score": 1.325241
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "price": 9.9,
        "score": 1.3166784
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/18427513736767114578",
        "price": 2.99,
        "score": 1.3156738
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "doc": {
          "discount": 0.71
        }
      }
    ]
  }
}

This time, because we haven’t applied on a filter based on the external join ids, we still have the full set of documents in the results set (3226 in total). Note that although there are 4 results in response.docs (as requested by rows=4), there are only 2 external results in x_product_offers.external – this is because only 2 of those 4 Solr documents have matching external results (in that they have the same value of join id in the join field, which in this case is the product id). In other words, only 2 out of the 4 products returned have discounts offered.

To achieve the price boost, instead of a filter query, we have a boost function:

bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2

For each Solr document in the results set, the value of the expression discount(x_product_offers) is found by calling getDiscount() on the matching external result in the x_product_offers XJoin search component. When there is no matching external result, the default value 0.0 is used, as configured for the value source parser in solrconfig.xml, which is equivalent to a 0% discount.

Of course, instead of the match-all q=* query, we can do an actual product search with our price boost, for example, q=apple. To be more sophisticated, we can also use the edismax parameter qf to query across both the name and description fields and weight them as we desire, for example, qf=name^4 description^2 or similar.

Joining on a field other than the unique id field

The join field does not have to correspond to the Solr unique id field. As seen above, the offers web API also returns discounts based on manufacturer (using the /manufacturers end-point). I configured another XJoin search component in solrconfig.xml called x_manufacturer_offers, the only differences from x_product_offers being the join field, which is now manufacturer, and the field which is taken from the external results to be the join value, which is of course the same, manufacturer.

So, now for example we can do a weighted query for “games software”, but restricting to products that have a manufacturer discount of at least 20%:

blog$ curl 'localhost:8983/solr/products/xjoin?q=software&qf=name^4+description^2&x_manufacturer_offers=true&fq=\{!frange+l=0.2\}discount(x_manufacturer_offers)&fl=*&rows=4' | jq .

See FunctionRangeQParserPlugin for details of the filter query used in this search. This gives something like (responseHeader.params omitted this time):

{
  "responseHeader": {
    "status": 0,
    "QTime": 4
  },
  "response": {
    "numFound": 25,
    "start": 0,
    "maxScore": 1.1224447,
    "docs": [
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/7436299398173390476",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17001745805951209994",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/10584509515076384561",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329559531500
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17283219592038470822",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329681166300
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "freeverse software",
        "doc": {
          "discount": 0.2
        }
      }
    ]
  }
}

In this case, there was only one manufacturer represented in the requested top 4 rows of the Solr results set.

Using two XJoin components in the same query

It’s worth noting that you can use more than one XJoin component in the same query. You can come up with more complicated examples, but this one shows how to query for all products that have a manufacturer discount as well as a product discount:

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&x_manufacturer_offers=true&fq=\{!xjoin\}x_product_offers&fq=\{!xjoin\}x_manufacturer_offers&fl=id,name,manufacturer&rows=4&wt=json' | jq .

You might have to try again a few times before you get a non-empty result set – here’s one I got:

{
  "responseHeader": {
    "status": 0,
    "QTime": 7
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "name": "apple software m8789z/a webobjects 5.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/4776201646741876078"
      },
      {
        "name": "apple software m9301z/b soundtrack v1.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/16537637847870148950"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/4776201646741876078",
        "doc": {
          "discount": 0.59
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/16537637847870148950",
        "doc": {
          "discount": 0.22
        }
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "apple software",
        "doc": {
          "discount": 0.3
        }
      }
    ]
  }
}

So you can see that there are two external results sections, one for product offers and one for manufacturer offers, and how the offers are matched to the products by the join ids (which is either the product id, or the manufacturer).

Next time…

In my next blog post, I’ll dive in to another demonstration of XJoin, in which I show how to use click-through data to influence the score of subsequent searches.

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life

Charlie Hull — Fri, 16 Oct 2015 13:17:50 +0000

BioSolr – Searching the stuff of life – Lucene/Solr Revolution 2015 from Charlie Hull

The post Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life appeared first on Flax.

BioSolr at BOSC 2015 – open source search for bioinformatics

Charlie Hull — Mon, 13 Jul 2015 08:31:30 +0000

Matt Pearce writes:

I spent most of last Friday at the Bioinformatics Open Source Conference (BOSC) Special Interest Group meeting in Dublin, as part of this year’s ISMB/ECCB conference. Tony Burdett from EMBL-EBI was giving a quick talk about the BioSolr project, and I went along to speak to people at the poster session afterwards about what we are doing, and how other teams could get involved.

Unfortunately, I missed the first half of Holly Bik’s keynote (registration seemed to take forever, hindered by dubious wifi and a printer that refused to cooperate), which used the vintage Oregon Trail game as an great analogy for biologists getting into bioinformatics – there are many, frequently intimidating, options when choosing how to analyse data, and picking the right one can be scary (this is something that definitely applies to the areas we work in as well).

There was a new approach to the traditional Q&A session afterwards as well, with questions being submitted on cards around the room, and via a Twitter hashtag. This worked pretty well, although Twitter latency did slow things down a couple of times, and there were a few shouted-out questions from the floor, but certainly better than having volunteers with microphones trying to reach the questioner across rows of people.

The morning session was on Data Science, and while a number of the talks went over my head somewhat, it was interesting to see how tools like Hadoop are being used in Bioinformatics. It was good to see the spirit of collaboration in action too, with Sebastian Schoenherr’s talk about CloudGene, a project that came about following an earlier BOSC that implements a graphical front end for Hadoop. Tony’s talk about BioSolr went down well – the show of hands for people in the room using Lucene, Solr and/or Elasticsearch indicated around 75% there were using search engines in some form. This backs up our earlier experience at the EBI, where the first BioSolr workshop was attended by teams from all over the campus, using Lucene or Solr in various versions to store and search their data.

Crossing over with lunch was the poster session, where Tony and I spoke to people about BioSolr. The Jalview team seemed especially interested in potential cross-over with their project, and there was plenty of interest generally in how the various extensions we have worked on (X-Join, hierarchical faceting) could be fitted into other projects.

The afternoon session was on the subject of Standards and Interoperability, starting with a great talk from Michael Crusoe about the Common Workflow Language, which started life at the BOSC 2014 codefest. There were several talks about Galaxy, a cloud-based platform for sharing data analyses, linking many other tools to allow workflows to be reproduced. Bruno Vieira’s talk about BioNode was also very interesting, and I made notes to check out oSwitch when time is available.

I had to leave before the afternoon’s panel took place, but all in all it was a very interesting day learning how open source software is being used outside of the areas I usually work in.

The post BioSolr at BOSC 2015 – open source search for bioinformatics appeared first on Flax.

Lucene/Solr London Meetup – BioSolr and Query Deep Dive

Charlie Hull — Fri, 24 Apr 2015 10:26:34 +0000

This week we held another Lucene/Solr London User Group event, kindly hosted by Barclays at their funky Escalator space in Whitechapel. First to talk were two colleagues of mine, Matt Pearce and Tom Winch, on the BioSolr project: funded by the BBSRC, this is an opportunity for us to work with bioinformaticians at the European Bioinformatics Institute on improving search facilities for systems including the Protein Databank in Europe (PDBe). Tom spoke about how we’ve added features to Solr for autocompleting searches using facets and a new way of integrating external similarity systems with Solr searches – in this case an EBI system that works with protein data – which we’ve named XJoin. Matt then spoke about various ways to index ontology data and how we’re hoping to work towards a standard method for working with ontologies using Solr. The code we’ve developed so far is available in our GitHub repository and the slides are available here.

Next was Upayavira of Odoko Ltd., expert Solr trainer and Apache Foundation member, with an engaging talk about Solr queries. Amongst other things he showed us some clever ways to parameterize queries so that a Solr endpoint can be customized for a particular purpose and how to combine different query parsers. His slides are available here.

Thanks all our speakers, to Barclays for providing the venue and for some very tasty food and to all who attended. We’re hoping the next event will be in the first week of June and will feature talks on measuring and improving relevancy with Solr.

The post Lucene/Solr London Meetup – BioSolr and Query Deep Dive appeared first on Flax.

Solr Superclusters for improved federated search

Charlie Hull — Tue, 20 Jan 2015 10:24:18 +0000

As part of our BioSolr project, we’ve been discussing how best to create a federated search over several Apache Solr instances. In this case various research institutions across the world are annotating data objects representing proteins and it would be useful to search not just the original protein data, but what others have added to the body of knowledge. If an institution wants to use the annotations, the usual approach is to download the extra data regularly and add it into a local Solr index.

Luckily Solr is widely used in the bioinformatics community so we have commonality in the query API. The question is would it be possible to use some of the distributed querying capabilities of SolrCloud to search not just the shards of a single index, but a group of Solr/SolrCloud indices – a supercluster.

This is a bit like a standard federated search, where queries are farmed out to various disparate search engines and the results then combined and displayed in some fashion. However, since we are sharing a single technology, powerful features such as result grouping would be possible.

For this to work at all, there would need to be some agreed standard between the various Solr systems: a globally unique record identifier for example (possibly implemented with a prefix unique to each institution). Any data that was required for result grouping would have to share a schema across the entire supercluster – let’s call this the primary schema – but basic searching and faceting could still be carried out over data with a differing, secondary schema. Solr dynamic fields might be useful for this secondary schema.

Luckily, research institutions are used to working as part of a consortium, and one of the conditions for joining would be agreeing to some common standards. A single Solr query API would then be available to all members of the consortium, to search not just their own data but everything available from their partners, without the slow and error-prone process of copying the data for local indexing.

We’re currently evaluating the feasibility of this idea and would welcome input from others – let us know what you think in the comments!

The post Solr Superclusters for improved federated search appeared first on Flax.

Autumn events roundup – ESS DC, Solr vs Elasticsearch & a new Meetup

Charlie Hull — Mon, 27 Oct 2014 16:05:24 +0000

It’s looking like a busy Autumn for search events – first, I’m presenting at Enterprise Search & Discovery 2014 in Washington DC on November 5th, talking about ‘Turning Search Upside Down with open source software’. I’ll be describing how we’ve replaced various underperforming, big name closed source search engines with faster & more scalable open source technology, including our own Luwak stored query engine. Do let me know if you’re in DC, I’d be very happy to meet up. The week after this is Lucene Revolution, which sadly we won’t be attending this year, but it is recommended if you’re interested in Lucene and Solr.

Towards the end of November there’s Search Solutions, a great day of presentations about all aspects of search held at the British Computer Society in Covent Garden. This year Tom Mortimer from Flax will be presenting some research we’ve done into performance comparisons between Lucene/Solr and Elasticsearch, and there are also presentations from Thomson Reuters, the British Library, Microsoft, Yahoo! and Google. I highly recommend this event, it’s always worth attending.

We’re also starting a new Meetup in London, a group for users of Apache Lucene/Solr (there’s an Elasticsearch London user group but strangely no equivalent for the other popular stack). Our first event is on November 28th, kindly hosted by Bloomberg (who are no strangers to Lucene/Solr themselves) and featuring Shalin Mangar, a Lucene/Solr committer from Lucidworks who is visiting Europe that week. We’re hoping that we can run these events every few months, but we need help from the community, so if you could talk, sponsor or host the Meetups do let us know.

In December we’ll be holding another Cambridge Search Meetup and will be talking about our work with the European Bioinformatics Institute on the BioSolr project – the date to be confirmed. Busy times!

The post Autumn events roundup – ESS DC, Solr vs Elasticsearch & a new Meetup appeared first on Flax.

Cambridge Search Meetup – Elasticsearch Hackday

Charlie Hull — Fri, 03 Oct 2014 12:32:00 +0000

Last Friday we hosted a hackday featuring Elasticsearch in Cambridge, following a similar event last year focused on Apache Lucene/Solr. Around 20 people attended from organisations working in sectors including analytics, digital music, bioinformatics and e-commerce, and all the Flax team were there as well.

We started with a brief presentation on Elasticsearch and asked around the room for any data collections we might be able to use. Lee from Elasticsearch (the company) had brought collections of UK crime data and the complete works of Shakespeare; we also had several million rows of digital music metadata, Wikipedia edit data for all UK MPs (to follow last year’s theme!) and several years of data describing Premier League football. Unlike our Solr hackday where each team worked on the same general task, this time we split into four different teams who worked on all of the above except the Wikipedia edits. We’d also been provided with a very high-performance Elasticsearch cluster by BigStep for our use, which meant it was very quick to index the above data and start working with it.

By lunchtime (the food was sponsored by Elasticsearch, who also provided stickers, plush ELKs and lollypops – thanks guys!) we had some very basic information about the various datasets – such as which scene in which Shakespeare play has the most characters on stage (the answer is 21 in Richard III), and which football teams seemed to gain the most advantage from playing at home. Note that we had already moved beyond basic search functionality to use Elasticsearch as an analytic platform, answering particular questions, using features such as aggregations.

We continued during the afternoon to develop the various applications and finished with a ‘show and tell’. Some of the teams had managed to develop user interfaces for Elasticsearch, the most polished being a clickable Google Map that would show you which types of crime were significantly above and below the national average for the area you selected – unsurprisingly in Cambridge, stolen bicycles were very common! By the end of the day, everyone had gained experience of Elasticsearch, some for the first time. We finished the day, as is traditional, with a swift pint and further networking.

Thanks to Cambridge Business Lounge (a highly recommended co-working space) for the venue, BigStep for hosting and Elasticsearch for sponsoring lunch and providing the swag, and of course to all who attended. We’ll return with a further Cambridge Search Meetup soon!

The post Cambridge Search Meetup – Elasticsearch Hackday appeared first on Flax.