Reference – Flax

Search Insights 2018 – a free, independent report on search

Charlie Hull — Mon, 26 Mar 2018 08:43:15 +0000

Over the last 17 years of running Flax I’ve met many people who loudly profess to be experts in various aspects of the search business. Some have a new product or service to sell, that promises to change the game forever; quite often this turns out to be snake oil or simply a new name for an old solution. Others seem to have arrived suddenly, fully-fledged, enthusiastic to convince us old hands that everything will be different now if we all sign up to their new idea.

There’s also a small group of people who tend to be quieter about their expertise, perhaps because as independent practitioners or small business owners they’re not supported by the marketing budgets of large companies. These people survive on their reputation, which has been built steadily on a record of solid advice, honesty and neutrality. I’m now lucky enough to be part of this group – an informal network of experts in subjects as diverse as search for Sharepoint, intranet strategy and taxonomy management. Occasionally we collaborate on projects, often we recommend each other to our clients and it’s always hugely enjoyable to meet in person and discuss the latest trends and industry landscape. This informal network means Flax can offer more services to our clients – and if we can’t help, we probably know someone we trust who can.

So I’m very proud to announce that this group – the Search Network – are releasing a joint publication, Search Insights 2018. In this 70-page collection of essays you can learn how to research, procure, choose, budget, plan and run a search project in the best way for your business and your users.

Unlike some other industry reports, we’re not charging for this report, you won’t have to register or give us your email address, and it’s Creative Commons licensed so you can even redistribute it if you like (with attribution). There’s no sponsorship, no plotting of vendors on confusing trend diagrams, no marketing buzzwords or direct recommendations – after all, we’re independent. We welcome any feedback you have of course.

My personal thanks to Martin White who has led this effort and who has also written about the Network and the report.

The post Search Insights 2018 – a free, independent report on search appeared first on Flax.

XJoin for Solr, part 2: a click-through example

Tom Winch — Fri, 29 Jan 2016 09:39:00 +0000

In my last blog post, I demonstrated how to set up and configure Solr to use the new XJoin search components we’ve developed for the BioSolr project, using an example from an e-commerce setting. This time, I’ll show how to use XJoin to make use of user click-through data to influence the score of products in searches.

I’ll step through things a bit quicker this time around and I’ll be using code from the last post so reading that first is highly recommended. I’ll assume that the prerequisites from last time have been installed and set up in the same directories.

The design

Suppose we have a web page for searching a collection of products, and when a user clicks on product listing in the result set (or perhaps, when they subsequently go on to buy that product – or both) we insert a record in an SQL database, storing the product id, the query terms they used, and an arbitrary weight value (which will depend on whether they merely clicked on a result, or if they went on to buy it, or some other behaviour such as mouse pointer tracking). We then want to use the click-through data stored in that database to boost products in searches that use those query terms again.

We could use the sum of the weights of all occurrences of a product id/query term combination as the product score boost, but then we might start to worry about a feedback process occurring. Alternatively, we might take the maximum or average weight across the occurrences. In the code below, we’ll use the maximum.

The advantage of this design over storing the click-through information in Solr is that you don’t have to update the Solr index every time there is user activity, which could become costly. An SQL database is much more suited to this task.

The external click-through API

Again, we’ll be using Python 3 (using the flask and sqlite3 modules) to implement the external API. I’ll be using this API to update the click-through database (by hand, for this example) as well as having Solr query it using XJoin. Here’s the code (partly based on code taken from here for caching the database connection in the Flask application context, and see here if you’re interested in more details about sqlite3’s support for full text search). Again, all the code written for this example is also available in the BioSolr GitHub repository:

from flask import Flask, request, g
import json
import sqlite3 as sql

# flask application context attribute for caching database connection
DB_APP_KEY = '_database'

# default weight for storing against queries
DEFAULT_WEIGHT = 1.0

app = Flask(__name__)

def get_db():
  """ Obtain a (cached) DB connection and return a cursor for it.
  """
  db = getattr(g, DB_APP_KEY, None)
  if db is None:
    db = sql.connect('click.db')
    setattr(g, DB_APP_KEY, db)
    c = db.cursor()
    c.execute("CREATE VIRTUAL TABLE IF NOT EXISTS click USING fts4 ("
                "id VARCHAR(256),"
                "q VARCHAR(256),"
                "weight FLOAT"
              ")")
    c.close()
  return db

@app.teardown_appcontext
def teardown_db(exception):
  db = getattr(g, DB_APP_KEY, None)
  if db is not None:
    db.close()

@app.route('/')
def main():
  return 'click-through API'

@app.route('/click/', methods=["PUT"])
def click(id):
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  try:
    w = float(request.args.get('weight', DEFAULT_WEIGHT))
  except ValueError:
    return 'Could not parse weight', 400

  # do the DB update
  db = get_db()
  try:
    c = db.cursor()
    c.execute("INSERT INTO click (id, q, weight) VALUES (?, ?, ?)", (id, q, w))
    db.commit()
    return 'OK'
  finally:
    c.close()

@app.route('/ids')
def ids():
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  
  # do the DB lookup
  try:
    c = get_db().cursor()
    c.execute("SELECT id, MAX(weight) FROM click WHERE q MATCH ? GROUP BY id", (q, ))
    return json.dumps([{ 'id': id, 'weight': w } for id, w in c])
  finally:
    c.close()

if __name__ == "__main__":
  app.run(port=8001, debug=True)

This web API exposes two end-points. First we have PUT /click/[id] which is used when we want to update the SQL database after a user click. For the purposes of this demonstration, we’ll be hitting this end-point by hand using curl to avoid having to write a web UI. The other end-point, GET /ids?[query terms], is used by our XJoin component and returns a JSON-formatted array of id/weight objects where the query terms from the database match those given in the query string.

Java glue code

Now we just need the Java glue code that sits between the XJoin component and our external API. Here’s an implementation of XJoinResultsFactory that does what we need:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class ClickXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
  }

  /**
   * Use 'click' REST API to fetch current click data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    String q = URLEncoder.encode(params.get("q"), "UTF-8");
    String apiUrl = url + "?q=" + q;
    try (HttpConnection http = new HttpConnection(apiUrl)) {
      JsonArray products = (JsonArray)http.getJson();
      return new ClickResults(products);
    }
  }
    
  public class ClickResults implements XJoinResults {
    private Map clickMap;
    
    public ClickResults(JsonArray products) {
      clickMap = new HashMap<>();
      for (JsonValue product : products) {
        JsonObject object = (JsonObject)product;
        String id = object.getString("id");
        double weight = object.getJsonNumber("weight").doubleValue();
        clickMap.put(id, new Click(id, weight));
      }
    }
    
    public int getCount() {
      return clickMap.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      return clickMap.keySet();
    }

    @Override
    public Object getResult(String id) {
      return clickMap.get(id);
    }      
  }
  
  public class Click {
    
    private String id;
    private double weight;
    
    public Click(String id, double weight) {
      this.id = id;
      this.weight = weight;
    }
    
    public String getId() {
      return id;
    }
    
    public double getWeight() {
      return weight;
    } 
  }
}

Unlike the previous example, this time getResults() does depend on the SolrParams argument, so that the user’s query, q, is passed to the external API. Store this Java source in blog/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java and compile into a JAR (again, we also need the HttpConnection class from the last blog post as well as javax.json-1.0.4.jar):

blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java
blog$ jar cvf click.jar -C bin .

Solr configuration

Starting with a fresh version of solrconfig.xml, insert these lines near the start to import the XJoin and user JARs (substitute /XXX with the full path to the parent of the blog directory):

And our request handler configuration:




  weight
  0.0



  uk.co.flax.examples.xjoin.ClickXJoinResultsFactory
  id
  
    http://localhost:8001/ids
  



  
    json
    none
    edismax
    description
    *

    false
    count
    *
  
  
    x_click
  
  
    x_click

Reload the Solr core (products) to get the new config in place.

Putting the pieces together

The following query will verify our Solr setup (remembering to escape curly brackets):

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&fl=id,name,score&rows=4' | jq .

I’ve used Solr parameter substitution with the q/qq parameters which will simplify later queries (this has been in Solr since 5.1). This query returns:

{
  "responseHeader": {
    "status": 0,
    "QTime": 25
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 2.9939778,
    "docs": [
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.8712361
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 1.8712361
      }
    ]
  }
}

Some repeat products in the data, but so far, so good. Next, get the click-through API running:

blog$ python3 click.py

And check it’s working (this should return [] whatever query is chosen because the click-through database is empty):

curl localhost:8001/ids?q=software | jq .

Now, let’s populate the click-through database by simulating user activity. Suppose, given the above product results, the user goes on to click through to the fourth product (or even buy it). Then, the UI would update the click web API to indicate this has happened. Let’s do this by hand, specifying the product id, the user’s query, and a weight score (here, I’ll use the value 3, supposing the user bought the product in the end):

curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/9200068133591804002?q=excel&weight=3'

Now, we can check the output that XJoin will see when using the click-through API:

blog$ curl localhost:8001/ids?q=excel | jq .

giving:

[
  {
    "weight": 3,
    "id": "http://www.google.com/base/feeds/snippets/9200068133591804002"
  }
]

Using the bf edismax parameter and the weight function set up in solrconfig.xml to extract the weight value from the external results stored in the x_click XJoin search component, we can boost product scores when they appear in the click-through database for the user’s query:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

which gives:

{
  "responseHeader": {
    "status": 0,
    "QTime": 13
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 3.2224145,
    "docs": [
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 3.2224145
      },
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.5559989
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
          "weight": 3
        }
      }
    ]
  }
}

Lo and behold, the product the user clicked on now appears top of the Solr results for the that query. Have a play with the API, generate some more user activity and see how this effects subsequent queries. It will cope fine with multiple-word queries, for example, suppose a user searches for ‘games software’:

curl 'http://localhost:8983/solr/products/xjoin?qq=games+software&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

There being no relevant queries in the click-through database, this has the same results as for a query without the XJoin, and as we can see, the value of response.x_click.count is 0:

{
  "responseHeader": {
    "status": 0,
    "QTime": 15
  },
  "response": {
    "numFound": 1158,
    "start": 0,
    "maxScore": 0.91356516,
    "docs": [
      {
        "name": "encore software 10568 - encore hoyle puzzle & board games 2005 - complete product - puzzle game - 1 user - complete product - standard - pc",
        "id": "http://www.google.com/base/feeds/snippets/4998847858583359731",
        "score": 0.91356516
      },
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 0.8699497
      },
      {
        "name": "encore software 10027 - hoyle board games (win 98 me 2000 xp)",
        "id": "http://www.google.com/base/feeds/snippets/8664755713112971171",
        "score": 0.85982025
      },
      {
        "name": "encore software 11253 - brain food games: cranium collection 2006 sb cs by encore",
        "id": "http://www.google.com/base/feeds/snippets/15401280256033043239",
        "score": 0.78744644
      }
    ]
  },
  "x_click": {
    "count": 0,
    "external": []
  }
}

Now let’s simulate the same user clicking on the second product (with default weight):

blog$ curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/826668451451666270?q=games+software'

Next, suppose another user then searches for just ‘games’:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=games&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

In the results, we see the ‘wild games’ product boosted to the top:

{
  "responseHeader": {
    "status": 0,
    "QTime": 60
  },
  "response": {
    "numFound": 212,
    "start": 0,
    "maxScore": 1.3652229,
    "docs": [
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 1.3652229
      },
      {
        "name": "xbox 360: ddr universe",
        "id": "http://www.google.com/base/feeds/snippets/16659259513615352372",
        "score": 0.95894843
      },
      {
        "name": "south park chef's luv shack",
        "id": "http://www.google.com/base/feeds/snippets/11648097795915093399",
        "score": 0.95894843
      },
      {
        "name": "egames. inc casual games pack",
        "id": "http://www.google.com/base/feeds/snippets/16700933768709687512",
        "score": 0.89483213
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
          "weight": 1
        }
      }
    ]
  }
}

Extensions

Of course, this approach can be extended to add in more sophisticated weighting and boosting strategies, or include more data about the user activity than just a simple weight score, which could be used to augment the display of the product in the UI (for example, “ten customers in the UK bought this product in the last month”).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

The post XJoin for Solr, part 2: a click-through example appeared first on Flax.

The fun and frustration of writing a plugin for Elasticsearch for ontology indexing

Matt Pearce — Wed, 27 Jan 2016 10:15:11 +0000

As part of our work on the BioSolr project, I have been continuing to work on the various Elasticsearch ontology annotation plugins (note that even though the project started with a focus on Solr – thus the name – we have also been developing some features for Elasticsearch). These are now largely working, with some quirks which will be mentioned below (they may not even be quirks, but they seem non-intuitive to me, so deserve a mention). It’s been a slightly painful process, as you may infer from the use of italics below, and we hope this post will illustrate some of the differences between writing plugins for Solr and Elasticsearch.

It’s probably worth noting that at least some of this write-up is speculative. I’m not privy to the internals of Elasticsearch, and have been building the plugin through a combination of looking at the Elasticsearch source code (as advised by the documentation) and running the same integration test over and over again for each of the various versions, and checking what was returned in the search response. There is very little in the way of documentation, and the 1.x versions of Elasticsearch have almost no comments or Javadoc in the code. It has been interesting and fun, and not at all exasperating or frustrating.

The code

The plugin code can be broken down into three broad sections:

A core module, containing code shared between the Elasticsearch and Solr versions of the plugin. Anything in this module should be search engine agnostic, and is dedicated to accessing and pulling data from ontologies, either via the OLS service (provided by the European Bioinformatics Institute, our partners in the BioSolr project) or more generally OWL files, and returning a structure which can be used by the plugins.
The es-ontology-annotator-core module, which is shared between all versions of the plugin, and contains Elasticsearch-specific code to build the helper classes required to access the ontology data.
The es-ontology-annotator-esx.x modules, which are specific to the various versions of Elasticsearch. So far, there are five of these (one of the more challenging aspects of this work has been that the Elasticsearch mapper structure has been evolving through the versions, as has some of the internal infrastructure supporting them):
- 1.3 – for ES 1.3
- 1.4 – for ES 1.4
- 1.5 – for ES 1.5 – 1.7
- 2.0 – for ES 2.0
- 2.1 – for ES 2.1.1
- 2.2 – for ES 2.2

I haven’t tried the plugin with any versions of ES earlier than 1.3. There was a change to the internal mapping classes between 1.4 and 1.5 (UpdateInPlaceHashMap was removed and replaced with CopyOnWriteHashMap), presumably for a Very Good Reason. Versions since 1.5 seem to be forward compatible with later 1.x versions.

The quirks

All of the versions of the plugin work in the same way. You specify in your mapping that a particular field has the type “ontology”. There are various additional properties that can be set, depending on whether you’re using an OWL file or OLS as your ontology data source (specified in the README). When the data is indexed, any information in that field is assumed to be an IRI referring to an ontology record, and will be used to fetch as much data as required/possible for that ontology record. The data will then be added as sub-fields to the ontology fields.

The new data is not added to the _source field, which is the easy way of seeing what data is in a stored record. In order to retrieve the new data, you have two options:

Grab the mapping for your index, and look through it for the sub-fields of your annotation field. Use as many of these as you need to populate the fields property in your search request, making sure you name them fully (ie. annotation.uri, annotation.label, annotation.child_uris).
Add all of the fields to the fields property in your search request (ie. "fields": [ "*" ]).

What you cannot do is add “annotation.*” to your search request to get all of the annotation subfields. At this stage, this doesn’t work. I’m still working out whether this is possible or not.

How it works

All of the versions work in a broadly similar fashion: the OntologyMapper class extends AbstractFieldMapper (Elasticsearch 1.x) or FieldMapper (Elasticsearch 2.x). The Mapper classes all have two internal classes:

a TypeParser, which reads the mapper’s configuration from the mapping details (as initially specified by the user, and as also returned from the Mapper.toXContent method), and returns…
a Builder, which constructs the mappers for the known sub-fields and ultimately builds the Mapper class. The sub-field mappers are all for string fields, with mappers for URI fields having tokenisation disabled, while the other fields have it enabled. All are both indexed and stored.

The Mapper parses the content of the initial field (the IRI for the ontology record), and adds the sub-fields to the record, as part of the Mapper.parse method call (this is the most significant part of the Mapper code). There are at least two ways of doing this, and the Elasticsearch source code has both depending on which Mapper class you look at. There is no indication in the source why you would use one method over the other. This helps with clarity, especially when things aren’t working as they should.

What makes life more interesting for the OntologyMapper class is that not all of the sub-fields are known at start time. If the user wishes to index additional relationships between nodes (“participates in”, “has disease location”, etc.), these are generated on the fly, and the sub-fields need to be added to the mapping. Figuring out how to do this, and also how to make sure those fields are returned when the use requests the mapping for the index, has been a particular challenge.

The TypeParser is called more than once during the indexing process. My initial assumption was that once the mapping details had been read from the user’s specification, the parser was “fixed,” and so you had to keep track of the sub-field mappers yourself. This is not the case. As noted above, the TypeParser can also be fed from the Mapper’s toXContent method (which generates the mapping seen when you call the _mapping endpoint). Elasticsearch versions 1.x didn’t seem to care particularly what toXContent returned, so long as it could be parsed without throwing a NullPointerException, but Elasticsearch versions 2.x actually check that all of the mapping configuration has been dealt with. This actually makes life easier internally – after the mapper has processed a record, at least some of the dynamic field mappings are known, so you can build the sub-field mappers in the Builder rather than having to build them on the fly during the Mapper.parse process.

The other non-trivial Mapper methods are:

toXContent, as mentioned several times already. This generates the mapping output (ie. the definition of the field as seen when you look via the _mapping endpoint).
merge, which seems to do a compatibility check between an incoming instance of the mapper and the current instance. I’ve added some checks to this, but no significant code. Several of the implementations of this method in the Elasticsearch source code simply contain comments to the effect of “will return to this later”, so it seems I’m not the only person who doesn’t understand how merge works, or why it is called.
traverse (Elasticsearch 1.x) and iterator (Elasticsearch 2.x), which seem to do similar things – namely providing a means to iterate through the sub-field mappers. In Elasticsearch 1.x, the traverse method is explicitly called as part of the process to add the new (dynamic) mappers to the mapping, but this isn’t a requirement for Elasticsearch 2.x. Elasticsearch 1.x distinguished between ObjectMappers and FieldMappers, which doesn’t seem to be a distinction in Elasticsearch 2.x.

Comparisons with the Solr plugin

The Solr plugin works somewhat differently to the Elasticsearch one. The Solr plugin is implemented as an UpdateRequestProcessor, and adds new fields directly to the incoming record (it doesn’t add sub-fields). This makes the returned data less tidy, but also easier to handle, since all of the new fields have the same prefix and can therefore be handled directly. You don’t need to explicitly tell Solr to return the new fields – because they are all stored, they are all returned by default.

On the other hand, you still have to jump through some hoops to work out which fields are dynamically generated, if you need to do that (i.e. to add checkboxes to a form to search “has disease location” or other relationships) – you need to call Solr to retrieve the schema, and use that as the basis for working out which are the new fields. For Elasticsearch, you have to request the mapping for your index, and use that in a similar way.

Configuration in Solr requires modifying the solrconfig.xml, once the plugin JAR file is in place, but doesn’t require any changes to the schema. All of the Elasticsearch configuration happens in the mapping definition. This reflects the different ways of implementing the plugin for Solr. I don’t have a particular feeling for whether it would have been better to implement the Solr plugin as a new field type – I did investigate, and it seemed much harder to do this, but it might be worth re-visiting if there is time available.

The Solr plugin was much easier to write, simply because the documentation is better. The Solr wiki has a very useful base page for writing a new UpdateRequestProcessor, and the source code has plenty of comments and Javadoc (although it’s not perfect in this respect – SolrCoreAware has no documentation at all, has been present since Solr 1.3, and was a requirement for keeping track of the Ontology helper threads).

I will most likely update this post as I become aware of things I have done which are wrong, or any misinformation it contains. We’ll also be talking further about the BioSolr project at a workshop event on February 3rd/4th 2016. We welcome feedback and comments, of course – especially from the wider Elasticsearch developer community.

The post The fun and frustration of writing a plugin for Elasticsearch for ontology indexing appeared first on Flax.

Principles of Solr application design – part 2 of 2

Charlie Hull — Tue, 17 Dec 2013 15:46:00 +0000

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! Here’s the second part, you can also read the first part.

8. Have enough RAM

The single biggest performance bottleneck in most search installations is lack of RAM. Search is an I/O-intensive process, and the more that disk reads can be cached in memory, the better performance will be. As a rough guideline, your available RAM should be at least 50% the total size of your Solr index files. For demanding applications, up to 100% of the index size may be necessary.

I/O caching is incremental rather than immediate, and some minutes of searches under load may be required to warm them. Don’t expect high performance until the caches are thoroughly warmed up.

An increasingly popular alternative is to use solid state disks (SSDs) instead of traditional hard disks. These are hundreds of times faster, and mean that cold searches should be reasonably fast. They also reduce the amount of RAM required to perhaps as little as 10% of the index size (although as always, this will require testing for the application in question).

9. Use a dedicated machine or VM

Don’t share your Solr servers with any other demanding processes such as SQL databases. For dependable performance, Solr should not have to compete with other processes for resources. VMs are an effective way of ring-fencing resources.

10. Use MMapDirectory and 64-bit systems

By default, Solr on 64-bit systems will open indexes with Lucene’s MMapDirectory, which memory-maps files rather opening them for read/write/seek. Don’t change this! MMapDirectory allows for the most effective use of resources, in particular RAM (which as already described is a crucial resource for search performance).

11. Tune the Solr caches

The OS disk cache improves performance at the low level. At the higher level, Solr has a number of built-in caches which are stored in the JVM heap, and which can improve performance still further. These include the filter cache, the field value cache, the query result cache and the document cache. The filter cache is probably the most important to tune if you are using filtered queries extensively or faceting with the enum method – each entry in the filter cache takes up ( number of docs on shard / 8 ) bytes of space, so if you’ve got a cache limit of 4,000 then you’ll require (numDocs * 500) bytes to hold all of them. However, tuning all of these caches has the potential to improve performance.

To tune the caches, you should allow Solr to run for a while with real or simulated search activity. Then go to the Plugin/Stats page in the admin web interface. The first important number in the cache statistics is ‘hitratio’. This should ideally be as close to 1.0 as possible, indicating that most lookups are being serviced by the cache. Then, ‘evictions’ indicates how many items have been removed from the cache due to limited space. This should ideally be as close to zero as possible, or at least much smaller than ‘lookups’.

If ‘evictions’ is high and ‘hitratio’ low, you should increase the maximum cache size in solrconfig.xml. It is impossible to say what a good starting point for a specific application is, but we often pick 4000.

If the cache is performing well, it may be worth reducing the maximum size and re-testing. The purpose of the maximum size is to prevent the cache growing without limit and filling the JVM heap, which links to point 12 below.

See here more information on Solr caches.

12. Minimise JVM heap space

Once you have tuned your Solr caches, try to reduce the maximum JVM heap (set with -Xmx) to a reasonably small size – big enough to hold the caches and all the other data required for searching and indexing, but not much bigger. There is a graphical depiction of the JVM heap in the Solr admin dashboard which allows a quick overview for rough tuning. For a better picture, it may be worth using a tool like JConsole to monitor the heap as the application is used.

The reason to reduce the heap size is to free RAM for the OS disk cache, as described in point 8.

Garbage collection (GC) can be a problem if the heap size is large. See here for information on GC tuning in Solr and other performance issues.

13. Handle multiple languages with multiple fields

Some search applications need to be able to support documents of different languages within the same index. This may conflict with the use of stemming, stopwords and synonyms to improve search accuracy. Furthermore, languages like Japanese are not tokenised by Solr in the same way as European languages, due to different conventions on word boundaries. One effective method for supporting mutiple languages in an index with per-language term processing is outlined as follows. Note that this depends on knowing in advance what language a section of text is in.

First, create a variant of each text field in the index schema for each language to be supported. The schema.xml supplied with Solr has example fieldtypes for a wide range of languages which may be adapted as necessary. For example:
˂field name="content_en" type="text_en" indexed="true" stored="true"/ ˃ ˂field name="content_fr" type="text_fr" indexed="true" stored="true"/ ˃ ˂field name="content_jp" type="text_jp" indexed="true" stored="true"/ ˃

Note the use of language codes to distinguish the names of the fields and fieldtypes. Then, when indexing each document, send each section of text to the appropriate field. E.g., if the document is entirely in English, send the whole thing to content_en. If it has sections in English, French and Japanese, send them to content_en, content_fr and content_jp respectively. This ensures that text is tokenised and normalised appropriately for its language.

Finally for searching, use the eDisMax query parser, and include all the language fields in the qf parameter (and pf, if using). E.g., in solrconfig.xml:
˂requestHandler name="/search" class="solr.SearchHandler"˃ ˂lst name="defaults"˃ ˂str name="qf"˃content_en content_fr content_jp˂/str˃ ˂str name="pf"˃content_en content_fr content_jp˂/str˃ ...
When a search is executed with this handler, subqueries will be generated for each language with the appropriate term processing, and searched against each language text field. This approach should give the best precision and recall in a multi-language application.

The post Principles of Solr application design – part 2 of 2 appeared first on Flax.

Principles of Solr application design – part 1 of 2

Charlie Hull — Wed, 11 Dec 2013 12:02:44 +0000

1. Use the latest release of Solr

Unless there are compelling reasons not to, such as reliance on a discontinued feature (which is rare), it is best to use the latest release of Solr, downloaded from http://lucene.apache.org/solr/ . Every minor release in the 4.x series has brought both functional and performance enhancements, and revision releases have fixed known bugs. Since the API (as a rule) remains backwards compatible, the potential gains in performance and utility should outweight the minor inconvenience of the upgrade.

2. Use SolrCloud for scaling and robustness

Before the Solr 4 release, support for sharding (distributing a single search over many Solr instances) and replication (for robustness and scaling search load) involved a significant amount of manual configuration and development. The introduction of SolrCloud means that sharding and replication are now built into the core product, and can be used with simple configuration and no extra coding.

For trivial applications, SolrCloud may not be required, but it is the simplest way to build in robustness and scalability. There’s more about SolrCloud here.

3. Don’t expose the Solr API

Although Solr is not inherently insecure, neither is it designed to be exposed to end-users (and emphatically not to the internet at large). Anyone with access to the root Solr endpoint would be able to delete indexes, modify or insert items at will. Restricting access to search handlers (e.g. /solr/select) avoids this possibility, but is nonetheless a bad idea since it may allow users to construct arbitrary queries which could degrade performance or provide access to unauthorised data. Furthermore, there remains the slim possibility of security holes in the Solr API.

For these reasons, any external access to search should be through a proxy interface which is restricted to the functionality required by the application. Access to the Solr API should be restricted by network design and/or firewalls. This applies equally to AJAX UIs, which should talk to Solr via an intermediary web application rather than directly.

The intermediary code should perform at least some basic validation of parameters before sending to Solr, for example checking their type and ensuring that query strings are under a certain length (depending on the search interface). This allows attempts at compromising the system to be detected at an early stage and blocked.

4. Don’t use third-party Solr client libraries

The problem with third-party client libraries is that they create a tight coupling between the application and Solr. The Solr XML and JSON APIs are simple, and a wide range of client libraries for these formats are readily available for most programming languages. Third-party libraries are an unnecessary additional dependency and a potential source of bugs and unexpected behaviour. Another risk is that development may be discontinued for various reasons, meaning that future Solr features are not easily accessible.

The one exception to this rule is the SolrJ Java client library, since it is part of the general Solr release and is therefore fully compliant with and tested against the corresponding version of Solr.

5. Specify interfaces

All interfaces between components in the application must be agreed between sys ops and developers before development is started. Interfaces should be treated as contracts which software components adhere to. Early documentation of interfaces will reduce the risk of unexpected dependencies leading to problems in deployment.

As far as possible, interfaces should be RESTful web APIs and use standard formats such as JSON and XML. This creates loose coupling between components and also makes it easy to test functionality from the command line or a browser.

6. Put apps live early, on isolated systems

Development should be iterative, with short development cycles (no more than a few weeks). Code should be tested and deployed at the end of each cycle. By using isolated systems, fake data and/or limiting access to authorised testers, functionality and performance may be tested as soon as possible on a ‘live’ system, avoiding the risk of unexpected problems if deployment is postponed until the end of the development cycle.

7. Do realistic performance tests early and often

Except for very small indexes, search performance is often unpredictable, particularly under load. To ensure that performance meets requirements, testing a full index under load with realistic queries should be scheduled as early as possible in development. If you don’t have the data available to create a full index, simulate it (e.g. using freely available text such as Wikipedia).

As new functions, e.g. facets, are added performance characteristics may change significantly, so it is important that performance tests are part of every development cycle. JMeter is a popular tool for load testing; alternatively test scripts could be easily written in a language like Python.

More to come next week!

The post Principles of Solr application design – part 1 of 2 appeared first on Flax.

An open approach to tuning search for gov.uk

Charlie Hull — Wed, 12 Jun 2013 13:18:29 +0000

Roo Reynolds from the GDS team has written a great blog post about the ongoing process of tuning the search for gov.uk which I can highly recommend.

We regularly see situations where a search project has been set up as ‘fire and forget’ – which is never a good idea: not only does content grow, but user needs change and search requirements evolve, whatever the application. Search should be a living project: monitoring user behaviour should reveal not just which searches ‘work’ (i.e. the user gets some results which they then click on) but more important which ones don’t. For example, common mispellings or acronyms might be a useful addition to a synonym list; if average search response times are lengthening then it might be time to consider performance tuning or even scaling out; the constant use of the ‘Next 10 Results’ button might indicate a problem with relevance ranking.

Luckily any improvements to gov.uk made by the GDS team should appear in their Github repository at some point – as I mentioned before the GDS team are (very sensibly) committed to an open source approach.

The post An open approach to tuning search for gov.uk appeared first on Flax.

Clade – a freely available, open source taxonomy and autoclassification tool

Charlie Hull — Tue, 12 Jun 2012 08:44:33 +0000

One way to manage digital information is to classify it into a series of categories or a heirarchical taxonomy, and traditionally this was done manually by analysts, who would examine each new document and decide where it should fit. Building and maintaining taxonomies can also be labour intensive, as these will change over time (for a simple example, just consider how political parties change and divide, with factions appearing and disappearing). Search engine technology can be used to automate this classification process and the taxonomy information used as metadata, so that search results can be easily filtered by category, or automatically delivered to those interested in a particular area of the heirarchy.

We’ve been working on an internal project to create a simple taxonomy manager, which we’re releasing today in a pre-alpha state as open source software. Clade lets you import, create and edit taxonomies in a browser-based interface and can then automatically classify a set of documents into the heirarchy you have defined, based on their content. Each taxonomy node is defined by a set of keywords, and the system can also suggest further keywords from documents attached to each node.

This screenshot shows the main Clade UI, with the controls:

A – dropdown to select a taxonomy
B – buttons to create, rename or delete a taxonomy
C – the main taxonomy tree display
D – button to add a category
E – button to rename a category
F – button to delete a category
G – information about the selected category
H – button to add a category keyword
I – button to edit a keyword
J – button to toggle the sense of a keyword
K – button to delete a keyword
L – suggested keywords
M – button to add a suggested keyword
N – list of matching document IDs
O – list of matching document titles
P – before and after document ranks

Clade is based on Apache Solr and the Stanford Natural Language Processing tools, and is written in Python and Java. You can run it on on either Unix/Linux or Windows platforms – do try it and let us know what you think, we’re very interested in any feedback especially from those who work with and manage taxonomies. The README file details how to install and download it.

The post Clade – a freely available, open source taxonomy and autoclassification tool appeared first on Flax.

Whitepaper – Why you should be considering open source search

Charlie Hull — Wed, 22 Jun 2011 09:49:50 +0000

I’ve uploaded a whitepaper I wrote a short while ago :

“In these rapidly changing times we don’t know what we will need to search tomorrow – so it’s important to be adaptable, flexible and able to cope with data volumes that may not scale linearly. Maintaining control over the future of your search software is also key. Open source search has come of age and every modern business should be aware of its advantages.”

The post Whitepaper – Why you should be considering open source search appeared first on Flax.

Background resources for Enterprise Search

Charlie Hull — Wed, 19 Jan 2011 16:19:12 +0000

If you’re planning an enterprise search project and have no background in the technologies or principles involved, here are some tips to get you started. This isn’t going to be a definitive list so if you know more, please do comment.

There haven’t been a lot of books written on this area over the years, but more are appearing now (especially on open source options). Managing Gigabytes is a good, if slightly elderly, starting point on basic principles. For thoughts on search user interfaces try Peter Morville’s Search Patterns and for an application focus there’s the recent Search Based Applications. For those developing in the Lucene/Solr world there’s the classic (and recently updated) Lucene in Action and the related Solr 1.4 Enterprise Search Server and Building Search Applications: Lucene, LingPipe, and Gate.

Most people will (of course) start their research on the web, although sometimes it’s hard to find nuggets of real information amongst all the marketing. Wikipedia has a list of vendors, including open source solutions, and Avi Rappaport maintains the useful (although not completely up to date) Search Tools website. Some vendors and some open source projects provide FAQs and tutorials (for example the Lucene FAQ, Xapian and Sphinx documentation), which may also contain general information about search principles.

You might also consider joining discussion groups such as the popular LinkedIn Enterprise Search Engine Professionals or a local Meetup group. Training is another option – offered by some vendors and open source companies such as ourselves.

The post Background resources for Enterprise Search appeared first on Flax.

More about LucidWorks Enterprise

Charlie Hull — Fri, 05 Nov 2010 11:36:19 +0000

If you’re considering a Lucene/Solr powered search solution, you may be interested in LucidWorks Enterprise, produced by our partners Lucid Imagination. They’ve taken Lucene/Solr and added a powerful admin GUI, ReST API, web spiders, file crawlers, database connectors, alerts, a clickthrough framework and more. All this comes with a range of excellent support options backed by the experts at Lucid.

If you’d like to know more read this downloadable PDF or contact us for more information and a demo.

The post More about LucidWorks Enterprise appeared first on Flax.