I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).
Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.
Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.
Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.
It certainly isn’t magic.
Pingback: Autonomy accounts and business model was suspect, analysts say | Latest News Channel
Pingback: Autonomy accounts and business model was suspect, analysts say | Tech & Comms News
Pingback: Autonomy accounts and business model was suspect, analysts say | Apple
I think you are completely missing the point.
Its nothing to do with search ranking algorithms but how you can extract results from the engine and get data into it.
If you can’t index all types of data, then your engine is next to useless in a corporate environment. If you can’t query the engine using an entire document to find similar documents (instead of manual search) then for most end users the system wont be used. Bayesian stats ALLOW you to use entire documents to query. Muscat et al. could never index all data types or query FAST with documents.
I’m sorry you’re incorrect – Both Muscat and the open source Xapian use Bayesian stats and allow you to use whole documents as a query – it’s one thing probabilistic engines are particularly good at. I’m not sure what you mean by ‘query FAST with documents’ in this context – FAST was a different technology.
All ‘enterprise search’ engines will attempt to index ‘all’ kinds of data on a corporate network and there are multiple ways to do this – file filters or even string extraction. Autonomy did of course buy in KeyView to assist this process, but KeyView isn’t unique, there are even lots of open source file filters. There are always going to be binary formats which are difficult to index: video and images are particularly hard, although sometimes you can ‘cheat’ using subtitles, script text for example.