Core search features are increasingly a commodity – you can knock up some indexing scripts in whatever scripting language you like in a short time, build a searchable inverted index with freely available open source software, and hook up your search UI quickly via HTTP – this all used to be a lot harder than it is now (unfortunately some vendors would have you believe this is still the case, which is reflected in their hefty price tags).
However we’re increasingly asked to develop features outside the traditional search stack, to make this standard search a lot more accurate/relevant or to apply ‘search’ to non-traditional areas. For example, Named Entity Recognition (NER) is a powerful technique to extract entities such as proper names from text – these can then be fed back into the indexing process as metadata for each document. Part of Speech (POS) tagging tells you which words are nouns, verbs etc. Sentiment Analysis promises to give you some idea of the ‘tone’ of a comment or news piece – positive, negative or neutral for example, very useful in e-commerce applications (did customers like your product?). Word Sense Disambiguation (WSD) attempts to tell you the context a word is being used in (did you mean pen for writing or pen for livestock?).
There are commercial offerings from companies such as Nstein and Lexalytics that offer some of these features. An increasing amount of companies provide their services as APIs, where you pay-per-use – for example Thomson Reuters OpenCalais service, Pingar from New Zealand and WSD specialists SpringSense. We’ve also worked with open source tools such as Stanford NLP which perform very well when compared to commercial offerings (and can certainly compete on cost grounds). Gensim is a powerful package that allows for semantic modelling of topics. The Apache Mahout machine learning library allows for these techniques to be scaled to very large data sets.
These techniques can be used to build systems that don’t just provide powerful and enhanced search, but automatic categorisation and classification into taxonomies, document clustering, recommendation engines and automatic identification of similar documents. It’s great to be thinking outside the box – the search box that is!