A customer of ours has a potential search application which requires (largely for reasons of performance) the ability to update specific individual fields of Apache Lucene documents. This is not the first time that someone has asked for this functionality. However, until now, it has been impossible to change field values in a Lucene document without re-indexing the entire document. This was due to the write-once design of Lucene index segment files, which would necessitate re-writing the entire file if a single value changes.
However, the introduction of pluggable codecs in Lucene 4.0 means that the concrete representation of index segments has been abstracted away from search functionality, and can be specified by the codec designer. The motivation for this was to make it possible to experiment with new compression schemes and other innovations, however it may also make it possible to overcome the current limitation of whole-document-only updates.
Andrzej Bialecki has proposed a “stacked update” design on top of the Lucene index format, in which changed fields are represented by “diff” documents which “overlay” the values of an existing document. If the “diff” document does not contain a certain field, then the value is taken from the original, overlaid document. This design is currently a work in progress.
Approaching the challenge independently, we have started to experiment with an alternative design, which makes a clear distinction between updatable and non-updateable fields. This is arguably a limitation, but one which may not be important in many practical applications (e.g. adding user tags to documents in a corpus). Non-updatable fields are stored using the standard Lucene codec, while updatable fields are stored externally by a codec that uses Redis, an open-source, flexible, fast key-value store. Updates to these fields could then be made directly in the Redis store using the JRedis library.
We have written a minimal, 2-day proof of concept, which can be checked out with:
svn checkout http://flaxcode.googlecode.com/svn/trunk/LuceneRedisCodec
There is still a significant amount of work to be done to make this approach robust and performant (e.g. when Lucene merges segments, the Redis document IDs will have to be remapped). At this stage we would welcome any comments and suggestions about our approach from anyone who is interested in this area of functionality.
Very cool!
I think on merge you shouldn’t have to remap document IDs? Once a segment is written, its docIDs are fixed, and merging just writes a new segment. So I think it should “just work”.
Thanks Mike!
The merge issue is down to the fact that Lucene segments do get replaced during a merge. e.g. say I have two segments, each with three docs:
[1, 2, 3] + [1, 2, 3]
then after merging we will just have
[1, 2, 3, 4, 5, 6]
and the redis codec will have to know about this. (I’m not the guy who implemented the POC so my understanding might be a bit off..)
Hi Tom,
The codec is in fact used to write the newly merged segment, so the Redis codec will see docs 1-6 being written. So I think you’ll be fine there.
Though, likely you’ll need to do something (maybe make a Directory wrapper?) to delete the postings from Redis when Lucene deletes segment files (after merge).
Mike
http://blog.mikemccandless.com
Hello, author of the POC here.
Mike is right here, merged segments are written through the codec so the document id remapping happens automatically. I hadn’t got as far as dealing with segment deletes yet, but it should be pretty simple – segment names are included in the redis key names, and it’s trivial to get a list of keys that match a given pattern (in this case, *_segmentname) and delete them.
The other obvious improvement here is to generalise the updates to use the existing codec writing machinery. At the moment we’re using a very naive postings format (basically just a list of integers, no skip lists or compression, and no support for frequency or position information). It should be possible to write something that combines an existing DocsEnum/DocsAndPositionsEnum with a series of Diffs, so that we can store postings data in one of the existing compressed formats and then rewrite term entries by streaming the data in, applying the diffs, and writing it out again in a format-independent fashion.
Pingback: Writing a new Lucene Codec | Romsey Software
Could you implement this? We also tried a similar (but not so similar) approach but we had some problems which could not get solved with codecs approach. Do put your thoughts and comments on the following:
Problem 1: Merge – Lucene Merge Thread keeps the new merged segment as a checkpointed segment and it is not committed.
There are two possible approaches here:
a) The custom Postings Consumer/ Terms Consumer does not write the merged information (docId renumbering info) to redis and instead store it in-memory. Partial Producer can search this in-memory structure for the new merged segment.Or,
b) At merge, write the new merged state to the redis store.
The problem with approach b) is that Reader may not be opened to the new merged segment, but redis store has removed the old segments which got merged. The search will fail in this case.
The problem with approach a) is that you can write that in-memory merge info to redis only at the next flush. The reason for that is that a custom PostingsFormat is invoked only at flush() or merge(). However, in a case like Solr’s Optimize command, there can be a commit without anything to flush. In this case, the in-memory merge info will not get to the redis store, but the uncommitted merged segment is committed in the Lucene Index. This is not a problem for search, however, in case you do something like Optimize and replicate, replication will be taking different information from stable storage.
Problem 2: Replication – How do you sync up indexed data in Lucene’s index dir and redis data directory. They both are asynchronous writes to stable storage.
problem 3: As mentioned above, the custom PostingsFormat is only invoked at flush or merge. If you add documents and then update this field, both in the same segment. This is not possible because this field with custom postings format is not yet written anywhere.
problem 4:LiveDocs issues. Lucene can mark a document dead and the custom postings format will get this information only at merge time. It appears that it is not a problem because a dead doc will be discovered by DocsEnum in search process. But this is a problem in reindexing docs when the reindexed docs go to an uncommited segment.
problem 5: A segment with no live docs is dropped at the next commit. This drop information does not go to the custom postings consumer. And it becomes messy to check for a segment with all docs dead at every flush in redis. Again, dead docs or dropped segment remaining in redis might not be a critical issue to solve – but it depends on your reindexing requirements.
Hi Aditya,
Thanks for your comment. The Redis Codec was just a proof of concept and we haven’t taken the idea any further. I think the Lucene mailing list (which I note you have already posted to) will be your best source of further help. There have been some recent improvements to Lucene DocValues which might also be worth investigating.
Charlie
Thanks Charlie for your response.