market – Flax

Elastic acquires Swiftype and broadens its offering to include enterprise search

Charlie Hull — Thu, 09 Nov 2017 16:12:09 +0000

The news today that Elastic (the company behind the open source Elasticsearch software) has acquired Swiftype will have surprised a few people, even though Elastic has already acquired a good number of other companies. Swiftype have a couple of products that deliver cloud-based site and enterprise search and under the hood, all of this is built on Elasticsearch. Swiftype are part of a new breed of enterprise search companies – often based on open source cores (such as Lucidworks & Attivio), able to index cloud applications and data and with modern, clean, responsive user interfaces.

The same problems remain however with making enterprise search work in practise: data locked in hard to access legacy systems, low-quality content and metadata, unrealistic expectations driven by over-optimistic marketing and most importantly the various people factors that affect all cross-departmental large-scale IT systems. No matter how clever the software, without the right people with the right training it’s very hard to deliver effective search.

It remains to be seen how the acquisition will change the enterprise search market – Elastic certainly have significant funding and admirable ambition – which in itself is probably enough to worry a few of Swiftype’s competitors.

The post Elastic acquires Swiftype and broadens its offering to include enterprise search appeared first on Flax.

A lack of cognition and some fresh FUD from Forrester

Charlie Hull — Wed, 14 Jun 2017 09:05:36 +0000

Last night the estimable Martin White, intranet and enterprise search expert and author of many books on the subject, flagged up two surprising articles from Forrester who have declared that Cognitive Search (we’ll define this using their own terms in a little while) is ‘overshadowing’ the ‘outmoded’ Enterprise Search, with a final dig at how much better commercial options are compared to open source.

Let’s start with the definition, helpfully provided in another post from Forrester. Apparently ‘Cognitive search solutions are different because they: Scale to handle a multitude of data sources and types’. Every enterprise search engine promises to index a multiplicity of content both structured and unstructured, so I can’t see why this is anything new. Next we have ‘Employ artificial intelligence technologies….natural language processing (NLP) and machine learning’. Again, NLP has been a feature of closed and open source enterprise search systems for years, be it for entity extraction, sentiment analysis or sentence parsing. Machine learning is a rising star but not always easy to apply to search problems. However I’m not convinced either of these are really ‘artificial intelligence’. Astonishingly, the last point is that Cognitive solutions ‘Enable developers to build search applications…provide SDKs, APIs, and/or visual design tools’. Every search engine needs user applications on top and has APIs of some kind, so this makes little sense to me.

Returning to the first article, we hear that indexing is ‘old fashioned’ (try building a search application without indexing – I’d love to know you’d manage that!) but luckily a group of closed-source search vendors have managed to ‘out-innovate’ the open source folks. We have the usual hackneyed ‘XX% of knowledge workers can’t find what they need’ phrases plus a sprinkling of ‘wouldn’t it be nice if everything worked like Siri or Amazon or Google’ (yes, it would, but comparing systems built on multi-billion-page Web indexes by Internet giants to enterprise search over at most a few million, non-curated, non-hyperlinked business documents is just silly – these are entirely different sets of problems). Again, we have mentions of basic NLP techniques like they’re something new and amazing.

The article mentions a group of closed source vendors who appear in Forrester’s Wave report, which like Gartner’s Magic Quadrant attempts to boil down what is in reality a very complex field into some overly simplistic graphics. Finishing with a quick dig at two open source companies (Elastic, who don’t really sell an enterprise search engine anyway, and Lucidworks whose Fusion 3 product really is a serious contender in this field, integrating Apache Spark for machine learning) it ignores the fact that open source search is developing at a furious rate – and there are machine learning features that actually work in practise being built and used by companies such as Bloomberg – and because they’re open source, these are available for anyone else to use.

To be honest It’s very difficult, if not impossible, to out-innovate thousands of developers across the world working in a collaborative manner. What we see in articles like the above is not analysis but marketing – a promise that shiny magic AI robots will solve your search problems, even if you don’t have a clear specification, an effective search team, clean and up-to-date content and all the many other things that are necessary to make search work well (to research this further read Martin’s books or the one I’m co-authoring at present – out later this year!). One should also bear in mind that marketing has to be paid for – and I’m pretty sure that the various closed-source vendors now providing downloads of Forrester’s report (because of course, they’re mentioned positively in it) don’t get to do so for free.

UPDATE: Martin has written three blog posts in response to both Gartner and Forrester’s recent reports which I urge you (and them) to read if you really want to know how new (or not) Cognitive Search is.

The post A lack of cognition and some fresh FUD from Forrester appeared first on Flax.

Time to replace your Google Search Appliance with open source search

Charlie Hull — Tue, 09 Feb 2016 14:25:02 +0000

As many others have noted, Google have recently announced their Google Search Appliance (GSA) will not be available for sale from 2017. Search gurus Miles Kehoe and Martin White have written an insightful analysis of the move with some recommendations as to what to do – because your GSA will simply stop working once the 2-year license expires. I don’t agree with Laurent Fanichet of Sinequa that this “seals the end of the era of commoditized search” – the rapid rise of open source search options has been the main driver of this commoditization, not the GSA. The appliance was also not at all cheap to run for large collections as Stephen Arnold has often noted.

So if you’re unfortunate enought to be one of the (possibly thousands) of GSA users worldwide, what next? There are a number of alternative search appliances and cloud solutions of course, but you should also consider insulating yourself from any future shocks by taking ownership of your search solution with a fully open source stack. Apache Lucene/Solr and Elasticsearch are great starting points for this of course, and you won’t be the first one to make the change (here’s an article from 2011 on Why a project switched from Google Search Appliance to Zend_Lucene which also lists some further problems with the GSA). A great advantage of open source is you won’t be subject to the whims or market success/failure of a vendor – you have a lot more control.

I know of at least one (very) large government agency in the UK, who having limped along for a long while with the FAST ESP search engine (yes, it’s still out there, over five years after Microsoft signalled it wouldn’t be supported) were considering the GSA as a replacement. It may be time for them to consider other options!

As ever, we’re happy to help anyone considering a migration – just get in touch.

The post Time to replace your Google Search Appliance with open source search appeared first on Flax.

London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg

Charlie Hull — Wed, 16 Dec 2015 16:21:32 +0000

This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and business. We started with Gabriella Kazai of Lumi, talking about how they have built a crowd-curated content platform for around 80,000 users whose interests and recommendations are mined so as to recommend content to others. Using Elasticsearch as a base, the system ingests around 100 million tweets a day and follows links to any quoted content, which is then filtered and analyzed using a variety of techniques including NLP and NER to produce a content pool of around 60,000 articles. I’ve been aware of Lumi since our ex-colleague Richard Boulton worked there but it was good to understand more about their software stack.

Next was Miguel Martinez-Alvarez of Signal, who are also dealing with huge amount of data on a daily basis – over a million documents a day from over 100,000 sources plus millions of blogs. Their ambition is to analyse “all the worlds’ news” and allow their users to create complex queries over this – “all startups in London working on Machine Learning” being one example. Their challenges include dealing with around 2/3rd of their ingested news articles being duplicates (due to syndicated content for example) and they have built a highly scalable platform, again with Elasticsearch a major part. Miguel talked in particular about how Signal work closely with academic researchers (including Professor Udo Kruschwitz of the University of Essex, with whom I will be collaborating next year) to develop cutting-edge analytics, with an Agile Data Science approach that includes some key evaluation questions e.g. Will it scale? Will the accuracy gain be worth the extra computing power?

Our last talk was from Miles Osborne of our hosts Bloomberg, who have recently signed a deal with Twitter to be able to ingest all past and forthcoming tweets – now that’s Big Data! The object of Miles’ research is to identify tweets that might affect a market and can thus be traded on, as early as possible after an event happens. His team have noticed that these tweets are often well-written (as opposed to the noise and abbreviations in most tweets) and seldom re-tweeted (no point letting your competitors know what you’ve spotted). Dealing with 500m tweets a day, they have developed systems to filter and route tweets into topic streams (which might represent a subject, location or bespoke category) using machine learning. One approach has been to build models using ‘found’ data (i.e. data that Bloomberg already has available) and to pursue a ‘simple is best’ methodology – although one model has 258 million features! Encouragingly, the systems they have built are now ‘good enough’ to react quickly enough to a crisis event that might significantly affect world markets.

We finished with networking, drinks and snacks (amply provided by our generous hosts) and I had a chance to catch up with a few old contacts and friends. Thanks to the organisers for a very interesting evening and the last event of this year for me – see you in 2016!

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

Enterprise Search Europe 2015 review – day 2

Charlie Hull — Thu, 29 Oct 2015 15:44:11 +0000

Not such an early start for me for Day 2 (I’d been up pretty late running the Meetup the night before) but I did manage to catch the very end of Findwise‘s presentation on their annual Enterprise Search and Findability Survey. This is a unique and valuable benchmark of the state of enterprise search – I urge you to read it, if for no other reason than to be optimistic about the fact that in 2015 nearly 50% of the organisations surveyed have a strategy for search and findability – compared to only 20% in 2012.

Sadly I missed COWI‘s talk on migrating from Autonomy to Sharepoint 2013 (as you might expect I would have asked why move from one closed source solution to another when open source options exist). I did however catch Kurt Kragh Sørenson of Intrateam talking about lessons learned from their Enterprise Search Community of Practice in Denmark and Sweden – one particular phrase that stood out for me was “If your colleagues have given up on your search function it will take a long time to re-establish trust in your search function again”. Next was Anita Wilcox of University College Cork with a talk on their implementation of an open source system, reSearcher, which I hadn’t heard of before. She also talked about a federated search built using exploreIT from Deep Web Technologies and added that one should focus on developing a minimal viable product rather than lots of ‘nice to have’ features. Note that the library is named after George Boole, father of the Boolean logic used in most search engines to construct complex queries!

Next was the presentation of the Tony Kent Strix Award to Professor Peter Ingwersen, which started with an amusing tale of how it may be difficult to take a statue of an owl through airport security. After lunch, I had to step out for a meeting so missed a talk about the Port of Antwerp, but was glad to return to hear from Paul Cleverley of Robert Gordon University who has done some fascinating work on the ‘why’ of enterprise search and how to measure the impact of search. I’ll be using his research to inform my forthcoming presentation at Search Solutions next month.

The day finished with a ‘search clinic’ panel chaired by Valentin Richter of Raytion. I was very glad to hear Steve Woodward of AstraZeneca confirm that he can see a role for real-time analytics driven by search – confirming what I had said in my keynote the day before.

This year’s event was in my mind the best in terms of the content of the presentations – including some inspirational case studies from very large companies on how enterprise search can deliver better ways of working. I was also particularly pleased to see so many mentions of open source search software – back in 2011, at the first Enterprise Search Europe conference, this was still a relatively unknown option. Thanks as ever to the conference chair, Martin White, and Information Today for running the event.

The post Enterprise Search Europe 2015 review – day 2 appeared first on Flax.

Rebrands and changing times for Elasticsearch

Charlie Hull — Wed, 11 Mar 2015 14:42:12 +0000

I’ve always been careful to distinguish between Elasticsearch (the open source search server based on Lucene) and Elasticsearch (the company formed by the authors of the former) and it seems someone was listening, as the latter has now rebranded as simply Elastic. This was one of the big announcements during their first conference, the other being that after acquiring Norwegian company Found they are now offering a fully hosted Elasticsearch-as-a-service (congratulations to Alex and others at Found!). As Ben Kepes of Forbes writes, this may be something to do with ‘managing tensions within the ecosystem’ (I’ve written previously on how this ecosystem is expanding to include closed-source commercial products, which may make open source enthusiasts nervous) but it’s also an attempt to move away from ‘search’ into a wider area encompassing the buzzwords-de-jour of Big Data Analytics.

In any case, it’s clear that Elastic (the company, and that’s hopefully the last time I’ll have to write this!) have a clear strategy for the future – to provide many different commercial options for Elasticsearch and its related projects for as many different use cases as possible. Of course, you can still take the open source route, which we’re helping several clients with at present – I hope to be able to present a case study on this very soon.

Meanwhile, Martin White has identified how a recent book on Elasticsearch describes literally hundreds of features and that ‘The skill lies in knowing which to implement given the nature of the content and the type of query that will be used’ – effective search, as ever, remains a difficult thing to get right, no matter what technology option you choose.

UPDATE: It seems that www.elasticsearch.org, the website for the open source project, is now redirecting to the commercial company website…there is now a new Github page for open source code at https://github.com/elastic

The post Rebrands and changing times for Elasticsearch appeared first on Flax.

A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access

Charlie Hull — Tue, 17 Feb 2015 11:25:26 +0000

Stephen Arnold, whose blog I enjoy due to its unabashed cynicism about overenthusiastic marketing of search technology, was kind enough to send me a copy of his recent report on CyberOSINT & Next Generation Information Access (NGIA), the latter being a term he has recently coined. OSINT itself refers to intelligence gathered from open, publically available sources, not anything to do with software licenses – so yes, this is all about the NSA, CIA and others, who as you might expect are keen on anything that can filter out the interesting from the noise. Let’s leave the definition (and the moral questionability) of ‘publically available’ aside for now – even if you disagree with its motives, this is a use case which can inform anyone with search requirements of the state of the art and what the future holds.

The report starts off with a foreword by Robert David Steele, who has had a varied and interesting career and lately has become a cheerleader for the other kind of open source – software – as a foundation for intelligence gathering. His view is that the tools used by the intelligence agencies ‘are also not good enough’ and ‘We have a very long way to go’. Although he writes that ‘the systems described in this volume have something to offer’ he later concludes that ‘This monograph is a starting point for those who might wish to demand a “full spectrum” solution, one that is 100% open source, and thus affordable, interoperable, and scalable.’ So for those of us in the open source sector, we could consider Arnold’s report as a good indicator of what to shoot for, a snapshot of the state of the art in search.

Arnold then starts the report with some explanation of the NGIA concept. This is largely a list of the common failings of traditional search platforms (basic keyword search, oft-confusing syntax, separate silos of information, lack of multimedia features and personalization) and how they might be addressed (natural language search, automatic querying, federated search, analytics). I am unconvinced this is as big a step as Arnold suggests though: it seems rather to imply that all past search systems were badly set up and configured and somehow a NGIA system will magically pull everything together for you and tell you the answer to questions you hadn’t even asked yet.

Disappointingly the exemplar chosen in the next chapter is Autonomy IDOL: regular readers will not be surprised by my feelings about this technology. Arnold suggests the creation of the Autonomy software was influenced by cracking World War II codes, rock music and artificial intelligence, which is in my mind adding egg to an already very eggy pudding, and not in step with what I know about the background of Cambridge Neurodynamics (Autonomy’s progenitor, created very soon after – and across the corridor from – Muscat, another Cambridge Bayesian search technology firm where Flax’s founders cut their teeth on search). In particular, Autonomy’s Kenjin tool – which automatically suggested related documents – is identified as a NGIA feature, although at the time I remember it being reminiscent of features we had built a year earlier at Muscat – we even applied for a patent. Arnold does note that ‘[Autonomy founder, Mike] Lynch and his colleagues clamped down on information about the inner workings of its smart software.’ and ‘The Autonomy approach locks down the IDOL components.’ – this was a magic black box of course, with a magically increasing price tag as well. The price tag rose to ridiculous dimensions (even after an equally ridiculous writedown) when Hewlett Packard bought the company.

The report continues with analysis of various other potential NGIA contenders, including Google-funded timeline analysis specialists Recorded Future and BAE Detica – interestingly one of the search specialists from this British company has now gone on to work at Elasticsearch.

The report concludes with a look at the future, correctly identifying advanced analytics as one key future trend. However this conclusion also echoes the foreword, with ‘The cost of proprietary licensing, maintenance, and training is now killing the marketplace. Open source alternatives will emerge, and among these may be a 900 pound gorilla that is free, interoperable and scalable.’. Although I have my issues with some of the examples chosen, the report will be very useful I’m sure to those in the intelligence sector, who like many are still looking for search that works.

The post A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access appeared first on Flax.

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Charlie Hull — Tue, 16 Dec 2014 14:22:30 +0000

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of Orchestrate.io, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

The post Elasticsearch London user group – The Guardian & Orchestrate test the limits appeared first on Flax.

More than an API – the real third wave of search technology

Charlie Hull — Tue, 18 Nov 2014 12:28:22 +0000

I recently read a blog post by Karl Hampson of Realise Okana (who offer HP Autonomy and SRCH2 as closed source search options) on his view of the ‘third wave’ of search. The second wave he identifies (correctly) as open source, admitting somewhat grudgingly that “We’d heard about Lucene for years but no customers seemed to take it seriously until all of a sudden they did”. However, he also suggests that there is a third wave on its way – and this is led by HP with its IDOL OnDemand offering.

I’m afraid to say I think that IDOL OnDemand is in fact neither innovative or market leading – it’s simply an API to a cloud hosted search engine and some associated services. Amazon Cloudsearch (originally backed by Amazon’s own A9 search engine, but more recently based on Apache Solr) offers a very similar thing, as do many other companies including Found.no and Qbox with an Elasticsearch backend. For those with relatively simple search requirements and no issues with hosting their data with a third party, these services can be great value. It is however interesting to see the transition of Autonomy’s offering from a hugely expensive license fee (plus support) model to an on-demand cloud service: the HP acquisition and the subsequent legal troubles have certainly shaken things up! At a recent conference I heard a HP representative even suggest that IDOL OnDemand is ‘free software’ which sounds like a slightly desperate attempt to jump on the open source bandwagon and attract some hacker interest without actually giving anything away.

So if a third wave of search technology does exist, what might it actually be? One might suggest that companies such as Attivio or our partners Lucidworks, with their integrated solutions built on proven and scalable open source cores and folding in Hadoop and other Big Data stacks, are surfing pretty high at present. Others such as Elasticsearch (the company) are offering advanced analytical capabilities and easy scalability. We hear about indexes of billions of items, thousands of separate indexes : the scale of some of these systems is incredible and only economically possible where license fees aren’t a factor. Across our own clients we’re seeing searches across huge collections of complex biological data and monitoring systems handling a million new stories a day. Perhaps the third wave of search hasn’t yet arrived – we’re just seeing the second wave continue to flood in.

One interesting potential third wave is the use of search technology to handle even higher volumes of data (which we’re going to receive from the Internet of Things apparently) – classifying, categorising and tagging streams of machine-generated data. Companies such as Twitter and LinkedIn are already moving towards these new models – Unified Log Processing is a commonly used term. Take a look at a recent experiment in connecting our own Luwak stored query library to Apache Samza, developed at LinkedIn for stream processing applications.

The post More than an API – the real third wave of search technology appeared first on Flax.

Analysts getting a bad press – how can they do better?

Charlie Hull — Wed, 30 Jul 2014 15:32:24 +0000

It seems to be a bad summer for analyst companies in several sectors: here’s Forrester getting a kicking from Digital Clarity Group about their Wave report on Digital Experience Delivery Platforms (my first challenge was understanding what on earth those are, but I think it’s a new shiny name for web content management), Nuix putting the boot into Gartner about their eDiscovery Magic Quadrant, and Stephen Few jumping up and down in hobnail boots on both analyst firms about Business Intelligence (insert your own joke here), complete with a not particularly enlightening reply from Forrester themselves.

Miles Kehoe has already taken a look at Gartner’s Magic Quadrant report on our own Enterprise Search sector. I’ve written before on how I don’t think open source solutions are particularly well treated by the large analyst firms, as they often focus on vendors only. The world has somewhat changed though and five of the seventeen vendors mentioned are using a base of open source technology, so at least some of this major part of the market is covered.

However the problem remains that the MQ ignores a great deal of the enterprise search sector: it doesn’t cover Sharepoint with its FAST-derived search facility, Oracle’s Endeca (which apparently is now no longer available as a standalone product, a surprise to me), Funnelback (which is again incorrectly labelled as open source – it’s the Squiz CMS software that’s open source, not the search engine they bought) or the rising star of Elasticsearch. If you were new to the sector you might conclude that none of these options are available to you. Gartner itself says “This Magic Quadrant introduces search managers and information architects in end-user organizations to the range of enterprise search vendors they can choose from” – but this range is severely and artificially restricted.

Let’s hope that the analyst firms take note of some of this bad press – perhaps it’s time to change approach, be more open about biases and methodologies, and stop producing hugely oversimplified diagrams to characterise complex and deep business sectors.

The post Analysts getting a bad press – how can they do better? appeared first on Flax.