As part of our BioSolr project, we’ve been discussing how best to create a federated search over several Apache Solr instances. In this case various research institutions across the world are annotating data objects representing proteins and it would be useful to search not just the original protein data, but what others have added to the body of knowledge. If an institution wants to use the annotations, the usual approach is to download the extra data regularly and add it into a local Solr index.
Luckily Solr is widely used in the bioinformatics community so we have commonality in the query API. The question is would it be possible to use some of the distributed querying capabilities of SolrCloud to search not just the shards of a single index, but a group of Solr/SolrCloud indices – a supercluster.
This is a bit like a standard federated search, where queries are farmed out to various disparate search engines and the results then combined and displayed in some fashion. However, since we are sharing a single technology, powerful features such as result grouping would be possible.
For this to work at all, there would need to be some agreed standard between the various Solr systems: a globally unique record identifier for example (possibly implemented with a prefix unique to each institution). Any data that was required for result grouping would have to share a schema across the entire supercluster – let’s call this the primary schema – but basic searching and faceting could still be carried out over data with a differing, secondary schema. Solr dynamic fields might be useful for this secondary schema.
Luckily, research institutions are used to working as part of a consortium, and one of the conditions for joining would be agreeing to some common standards. A single Solr query API would then be available to all members of the consortium, to search not just their own data but everything available from their partners, without the slow and error-prone process of copying the data for local indexing.
We’re currently evaluating the feasibility of this idea and would welcome input from others – let us know what you think in the comments!
We have a hacked version of this running locally, emulating the Solr API (or something close to it at least). It would be great to have a common implementation of this functionality – ours is not a viable starting point as it is way too geared towards our needs.
Of course there needs to be some common core, at least conceptually, for such a distributed search to make sense, but a lot of it can be handled without changing the end point indexes: There is no real need for globally unique identifiers, as the addition and removal of site-specific prefixes can be handled in the aggregator. Likewise, variations in field names or even terms can be handled by mapping. One big challenge is relevance ranking, but SOLR-1632 should help there, coupled with aggregator-dictated weighting of the different sites to compensate for local boosting. We do grouping across one Solr instance which has the grouping field and one which does not, compensating by presenting the document results from the missing-field-Solr as single-entry groups.
Most important parts of the Solr API are fairly stable so the major missing piece seems to be a Solr-bridge, exposed as a plain shard but internally taking care of the mapping and the calls to a remote Solr.
You should consider discussing this subject on the developer mailing list.