google – Flax

Time to replace your Google Search Appliance with open source search

Charlie Hull — Tue, 09 Feb 2016 14:25:02 +0000

As many others have noted, Google have recently announced their Google Search Appliance (GSA) will not be available for sale from 2017. Search gurus Miles Kehoe and Martin White have written an insightful analysis of the move with some recommendations as to what to do – because your GSA will simply stop working once the 2-year license expires. I don’t agree with Laurent Fanichet of Sinequa that this “seals the end of the era of commoditized search” – the rapid rise of open source search options has been the main driver of this commoditization, not the GSA. The appliance was also not at all cheap to run for large collections as Stephen Arnold has often noted.

So if you’re unfortunate enought to be one of the (possibly thousands) of GSA users worldwide, what next? There are a number of alternative search appliances and cloud solutions of course, but you should also consider insulating yourself from any future shocks by taking ownership of your search solution with a fully open source stack. Apache Lucene/Solr and Elasticsearch are great starting points for this of course, and you won’t be the first one to make the change (here’s an article from 2011 on Why a project switched from Google Search Appliance to Zend_Lucene which also lists some further problems with the GSA). A great advantage of open source is you won’t be subject to the whims or market success/failure of a vendor – you have a lot more control.

I know of at least one (very) large government agency in the UK, who having limped along for a long while with the FAST ESP search engine (yes, it’s still out there, over five years after Microsoft signalled it wouldn’t be supported) were considering the GSA as a replacement. It may be time for them to consider other options!

As ever, we’re happy to help anyone considering a migration – just get in touch.

The post Time to replace your Google Search Appliance with open source search appeared first on Flax.

Search Solutions 2015 – Is semantic search finally here?

Charlie Hull — Thu, 04 Dec 2014 14:07:20 +0000

Last week I attended one of my favourite annual search events, Search Solutions, held at the British Computer Society’s base in Covent Garden. As usual this is a great chance to see what’s new in the linked worlds of web, intranet and enterprise search and this year there was a focus on semantic search by several of the presenters.

Peter Mika of Yahoo! started us off with a brief history of semantic search including how misplaced expectations have led to a general lack of adoption. However, the large web search companies have made significant progress over the years leading to shared standards for semantically marking of web content and some large collections of knowledge, which allows them to display content for certain queries, e.g. actor’s biographies shown on the right of the usual search results. He suggested the next step is to better understand queries as most of the work to date has been on understanding documents. Christopher Semturs of Google followed with a description of their efforts in this space, Google’s Knowledge Graph containing 40 billion facts about 530 million entities, built in part by converting web pages directly (including how some badly structured websites can contain the most interesting and rare knowledge). He reminded us of the importance of context and showed some great examples of queries that are still hard to answer correctly. Katja Hofmann of Microsoft then described some ways in which search engines might learn directly from user interactions, including some wonderfully named methodologies such as Counterfactual Reasoning and the Contextual Bandit. She also mentioned their continuing work on Learning to Rank with the open source Lerot software.

Next up was our own Tom Mortimer presenting our study comparing the performance of Apache Solr and Elasticsearch – you can see his slides here. While there are few differences Tom has found that Solr can support three times the query rate. Iadh Ounis of the University of Glasgow followed, describing another open source engine, Terrier, which although mainly focused on academic research does now contain some cutting edge features including the aforementioned Learning to Rank and near real-time search.

The next session featured Dan Jackson of UCL describing the challenges of building website search across a complex set of websites and data, a similar talk to one he gave at an earlier event this year. Next was our ex-colleague Richard Boulton describing how the Gov.uk team use metrics to tune their search capability (based on Elasticsearch). Interestingly most of their metric data is drawn from Google Analytics, as a heavy use of caching means they have few useful query logs.

Jussi Karlgren of Gavagai then described how they have built a ‘living lexicon’ of text in several languages, allowing for the representation of the huge volume of new terms that appear on social media every week. They have also worked on multi-dimensional sentiment analysis and visualisations: I’ll be following these developments with interest as they echo some of the work we have done in media monitoring. Richard Ranft of the British Library then showed us some of the ways search is used to access the BL’s collection of 6 million audio tracks including very early wax cylinder recordings – they have so much content it would take you 115 years to listen to it all! The last presentation of the day was by Jochen Leidner of Thomson Reuters who showed some of the R&D projects he has worked on for data including legal content and mining Twitter for trading signals.

After a quick fishbowl discussion and a glass of wine the event ended for me, but I’d like to thank the BCS IRSG for a fascinating day and for inviting us to speak – see you next year!

The post Search Solutions 2015 – Is semantic search finally here? appeared first on Flax.

Search Solutions 2013, a review

Charlie Hull — Thu, 28 Nov 2013 14:25:18 +0000

Yesterday was the always interesting Search Solutions one day conference held by the BCS IRSG in London, a mix of talks on different aspects of search. The first presentation was by Behshad Behzadi of Google on Conversational Search, where he showed a speech-capable search interface that allowed a ‘conversation’ with the search engine – context being preserved – so the query “where are Italian restaurants in Chelsea” followed by “no I prefer Chinese” would correctly return results about Chinese restaurants. The demo was impressive and we can expect to see more of this kind of technology as smartphone adoption rises. Wim Nijmeijer of Coveo followed with details of how their own custom connectors to a multitude of repositories could enable Complex enterprise search delivered in a day. This of course assumes that no complex mapping of fields or schemas from the source to the search engine index is necessary, which I suspect it often is – I’m not alone in being slightly suspicious of the supposed timescale. Nikolaos Nanas from Thessaly in Greece then presented on Adaptive Information Filtering: from theory to practise which I found particularly interesting as it described filtering documents against a user’s interest with the latter modelled by an adaptive, weighted network – he showed the Noowit personalised magazine application as an example. With over 1000 features per user and no language specific requirements this is a powerful idea.

After a short break we continued with a talk by Henning Rode on CV Search at TextKernel. He described a simple yet powerful UI for searching CVs (resumes) with autosuggest and automatic field recognition (type in “Jav” and the system suggests “Java” and knows this is a programming language or skill). He is also working on systems to autogenerate queries from job vacancies using heuristics. We’ve worked in the recruitment space ourselves so it was interesting to hear about their approach, although the technical detail was light. Following Henning was Dermot Frost talking about Information Preservation and Access at the Digital Repository of Ireland and their use of open source technology including Solr and Blacklight to build a search engine with a huge variety of content types, file formats and metadata standards across the items they are trying to digitally preserve. Currently this is a relatively small collection of data but they are planning to scale up over the next few years: this talk reminded me a little of last year‘s by Emma Bayne of the UK’s National Archive.

After lunch we began a session named Understanding the User, beginning with Filip Radlinski of Microsoft Research. He discussed Sensitive Online Search Evaluation (with arXiv.org as a test collection) and how interleaved results is a powerful technique for avoiding bias. Next was Mounia Lalmas of Yahoo! Labs on what makes An Engaging Click (although unfortunately I had to pop out for a short while so I missed most of what I am sure was a fascinating talk!). Mags Hanley was next on Understanding users search intent with examples drawn from her work at TimeOut – the three main lessons being to know the content in context, the time of year and the users’ mental model in context. Interestingly she showed how the most popular facets used differed across TimeOut’s various international sites – in Paris the top facet was perhaps unsurprisingly ‘cuisine’, in London it was ‘date’.

After another short break we continued with Helen Lippell‘s talk on Enterprise Search – how to triage problems quickly and prescribe the right medicine – her five main points being analyze user needs, fix broken content, focus on quick wins in the search UI, make sure you are able to tweak the search engine itself in a documentable fashion and remember the importance of people and process. Her last point ‘if search is a political football, get an outsider perspective’ is of course something we would agree with! Next was Peter Wallqvist of Ravn Systems on Universal Search and Social Networking where he focussed on how to allow users to interact directly with enterprise content items by tagging, sharing and commenting – so as to derive a ‘knowledge graph’ showing how people are connected by their relationships to content. We’ve built systems in the past that have allowed users to tag items in the search result screen itself so we can agree on the value of this approach. Our last presenter with Kristian Norling of Findwise on Reflections on the 2013 Enterprise Search Survey – some more positive news this year, with budgets for search increasing and 79% of respondents indicating that finding information is of high importance for their organisation. Although most respondents still have less than one full time staff member working on search, Kristian made the very good point that recruiting just one extra person would thus give them a competitive advantage. Perhaps as he says we’ve now reached a tipping point for the adoption of properly funded enterprise search regarded as an ongoing journey rather than a ‘fire and forget’ project.

The day finished with a ‘fishbowl’ session, during which there was a lot of discussion of how to foster links between the academic IR community and industry, then the BCS IRSG AGM and finally a drinks reception – thanks to all the organisers for a very interesting and enlightening day and we look forward to next year!

The post Search Solutions 2013, a review appeared first on Flax.

The trouble with tabbing: editing rich text on the Web

Charlie Hull — Thu, 08 Aug 2013 08:36:16 +0000

Matt Pearce, who joined the Flax team earlier this year, writes:

A recent client wished to convert documents to and from Microsoft Office formats, using a web form as an intermediate step for editing the content. The documents were read in, imported to a Solr search engine, and could then be searched over, cloned, edited and transformed in batches, before being exported to Office once more.

The content itself was broken down into fields, some of which were simple text or date entry boxes, while others were more complex rich text fields. We opted to use TinyMCE as our rich text editor of choice – it’s small, open source, and easy to extend (we already knew we wanted to write at least one plugin).

The problem arose when the client explained to us that they wanted to use the tab key in rich text fields to create consistent spacing in the text. These needed to display as closely as possible to the original document format, and convert to actual tabs in the Office documents. This presented a number of problems:
By default, the tab key moves the user to the next field on a web page, and needs special handling to prevent this behaviour, especially when it only needs to be applied to certain fields on the page. The spacing had to be consistent, like a word processor’s tab stop. This is tricky when working with proportional fonts, especially in a web form.

The client didn’t want to use an indent feature. The tab only came at the start of the paragraph – beyond that point the text could wrap around to the start of the line. The tab needed to be recognisable in our processing code, so it could be converted to a real tab when it was exported to MS Office.

The preferred solution would have been a document editor like that used for Google Docs. Unfortunately, we didn’t have the time to write the whole input and presentation layer in Javascript as Google have! We also wanted to keep the editing function inside the web application if possible, rather than forcing the user to edit the documents in Microsoft Office and then re-import them every time they needed to make changes.

I started with TinyMCE’s “nonbreaking” plugin, which captures the tab key and converts it to a number of non-breaking spaces. This wasn’t directly suitable for our needs – I discovered that the number of spaces is not always consistent, and they are sometimes converted to regular (rather than non-breaking) spaces. In addition, it doesn’t act like a tab stop – it inserts four spaces wherever you are on the line, which didn’t match the client’s requirement.

I adapted the plugin to insert a into the text, using variable padding to ensure it was the right width. This worked reasonably well, after a not insignificant amount of head scratching trying to work around issues with spacing and space handling. Unfortunately, we struck usability problems when trying to backspace over the tab. The ideal situation would be that a single backspace would remove the entire tab, leaving the user at the start of the line (or the point before they hit the tab key). In fact, a single backspace would leave the user inside the span – two backspaces were required to visibly remove the tab from the editor, and the user could not tell that they were inside the span either. You couldn’t reliably select the “tab” with the mouse either. In addition, Firefox started to behave oddly at this point, putting the cursor in unexpected positions.

My final solution was ugly but workable. We switched to using a monospace font in the rich text editor and, after discussion with the client, started using a variable number of arrow characters to represent the tabs (we actually used ›, or a closing single quote, if you are reading and writing in German). This made life immediately simpler – dropping the proportional font meant that we didn’t have to worry about getting the width right, just the number of characters to insert. It does mean that in order to remove the tab, the user has to backspace over up to four characters, but the characters are clearly visible: you don’t find yourself inside a span that can’t be seen without viewing the underlying HTML.

While I’m sure this isn’t a unique problem, I couldn’t find anyone else that had been trying to do something similar. I am also not sure whether our choice of rich text editor affected how tricky this problem turned out to be. If anybody reading has suggestions of better approaches to this, we’d be interested to hear from them.

The post The trouble with tabbing: editing rich text on the Web appeared first on Flax.

The death of enterprise search is reported, again

Charlie Hull — Thu, 25 Oct 2012 08:39:42 +0000

There’s no doubt that the search market has been in turmoil for many months now: traditional, closed source vendors are either frantically repositioning to avoid the ‘juggernaut that is Apache’s Solr/Lucene project’ or attempting to bore customers to death with Powerpoint. Our sources tell us that in the UK at least, sales of most closed source search engines have flatlined – not at all surprising when freely available alternatives exist. Luckily there are some parts of the sector with some energy: Attivio (with $34m of new funding to spend) and Lucidworks are still working hard on their search products, but even these rely heavily on an open source core.

Enter a company without any history or experience in the search market, Huddle, with a tired message about the death of Enterprise Search. I’m not entirely sure what the point of this article is, but apparently the lack of contextual information is the problem – “You have to do research in 50 places — email, Web, C-drives, the cloud, even inside people’s heads.”. I look forward to a brain-compatible indexing tool! There’s also the misassumption that what works for the wider consumer-focused Web will work for the enterprise – Amazon.com, Google and the iPad/iPhone are all namechecked. Enterprise data simply isn’t like web or consumer data – it’s characterised by rarity and unconnectedness rather than popularity and context.

Unfortunately in most enterprises simply sprinkling on social or collaborative features will not fix the most common search problems: a mishmash of unconnected legacy systems, unreliable and inconsistent metadata, a complex and untested security model (at least within the context of being able to search for everything, for example your bosses’ salary) and usually the lack of a dedicated team responsible for search. Enterprise Search is hard and few projects get beyond basic indexing of filestores and databases, let along adding in more people-focused features.

I couldn’t find much about search on Huddle’s website, but what I did find implied that information must first be extracted from existing legacy systems and stored centrally. If you can manage this, preserving a consistent metadata model, coping with legacy formats, preserving full security and coping with updates then search should be relatively simple to implement on the resulting central store; however the devil is as ever in the detail.

The post The death of enterprise search is reported, again appeared first on Flax.

Google Search Appliance version 7 – too little too late?

Charlie Hull — Wed, 10 Oct 2012 12:26:35 +0000

Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one though).

Although there’s a raft of new features, most of them have been introduced by the GSA’s competitors over the last few years or are available as open source (entity recognition or document preview for example). The GSA is also not a particularly cheap option as commentators including Stephen Arnold have noticed: we’ve had clients tell us of six-figure license fees for reasonably sized collections of a few millions of documents – and that’s for two years, after which time you have to buy it again. Not surprisingly some people have migrated to other solutions.

However there’s another question that seems to have been missed by Google’s strategists: how a physical appliance can compete with cloud-based search. I can’t think of a single prospective client over the last year or so who hasn’t considered this latter option on both cost and scalability grounds (and we’ll shortly be able to talk about a very large client who have chosen this route). Although there may well be a hard core of GSA customers who want a real box in reassuring Google yellow, one wonders why Google haven’t considered a ‘virtual’ GSA to compete with Amazon’s CloudSearch amongst others.

It will be interesting to see if this version of the GSA is the last…

The post Google Search Appliance version 7 – too little too late? appeared first on Flax.

Enterprise Search Europe 2012 – Big Data, search surveys and some FUD from Google

Charlie Hull — Wed, 06 Jun 2012 14:50:14 +0000

I visited Enterprise Search Europe for the first day only last week, and caught a number of the presentations as well as giving one of my own (which I won’t discuss here but you’ll hear more about over the next few weeks). First up was Paul Doscher of Lucid Imagination with a lively presentation discussing whether search is either dead or now a commodity, or whether search on Hadoop is the new killer app for the emerging world of Big Data. We then had Kristian Norling from Findwise with some initial results from their survey on enterprise search – some interesting numbers here such as ‘18.5% of users are mostly/very satisfied with search’ and only ‘6% have a search strategy although 46% are planning one’ – we hear that Kristian is hoping to make the survey an annual one, which will be a great resource for anyone in the industry.

Matt Mullen, fuelled by diet cola, gave an introduction to search with a key point – that enterprise search usually performs a role within a workflow or task – a fact often ignored. Runar Buvik of Searchdaimon talked about a great resource he has developed comparing search engines, which can give some often amusing contrasts between different technologies, with some insisting there are no results for a particular query while others find thousands. I also enjoyed Emma Bayne and Donald Phillips polished presentation on the search facilities at the National Archives – interestingly although Autonomy is currently powering their search they are considering open source alternatives.

The day concluded with a presentation from Matt Eichner of Google, who turned up with their own film crew. You can read much of what he said at Computer World. I’m afraid I didn’t enjoy this presentation very much – it talked down to the audience and contained a lot of FUD around open source (surprising when Google uses and supports so much of it) – complete with sympathy-garnering pictures of babies in incubators and silly analogies about how one should prefer to fly in the airplane that cost the most. I hadn’t realised until his talk that the Google Search Appliance appears to be made of cheese!

It was great to network and catch up, and I hope next year to be able to attend the whole event. Thanks to all the organisers especially Martin White of Intranet Focus.

The post Enterprise Search Europe 2012 – Big Data, search surveys and some FUD from Google appeared first on Flax.

Another powerful API based on Solr launches, searching more patents than Google

Charlie Hull — Fri, 07 Oct 2011 11:57:20 +0000

Our customer Cambridge Intellectual Property announced yesterday their new API for a collection of 55 million patents – 48 million more than Google Patents. It’s great to see a Cambridge company innovating in this space, especially as the service is powered by Apache Solr (we’ve given them some small assistance with configuring and tuning this software over the last few months).

The API, available on the Boliven website, offers a REST based service and returns patent data in JSON or XML – so users can easily integrate patent data with their own applications. It can also return PDFs or summaries of the selected patents. In addition, the API will allow users to search and query Boliven’s database of 45+ million science literature documents including journal publications and medical device trials. That’s around 100 million items in total.

Like the Guardian’s Open Platform which I wrote about previously, this is a great example of open source search technology as a platform for new delivery methods – showing how effective (and economical) it can be at this large scale.

It didn’t take me long to find my own small contribution to the patent landscape.

The post Another powerful API based on Solr launches, searching more patents than Google appeared first on Flax.

Open source search evening – ElasticSearch, Xapian and GSoC

Charlie Hull — Wed, 04 May 2011 12:42:35 +0000

Last night there was a small gathering in Cambridge of open source search engine developers and enthusiasts. Richard Boulton hosted the event and began with an introduction to elasticsearch, which is an “Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene”. Richard told us about how this system attempts to make prototyping and building search systems easier by automatically guessing data schemas, offering a powerful, heirarchical ‘query language’ and automatically distributing the search load. Richard’s conclusions were that although elasticsearch is not as mature as Apache Solr it is certainly a project to consider: however development is rapid and documentation is not easy to find. We’ll watch this project with interest.

Olly Betts next told us about various Xapian projects running as part of this year’s Google Summer of Code; this led into a discussion of Learning to Rank and how this might be implemented in practical terms. It’s great to see these cutting-edge features being added to an open source project.

Thanks to Richard for organising the evening and to all who came.

The post Open source search evening – ElasticSearch, Xapian and GSoC appeared first on Flax.

ECIR 2011 Industry Day – part 1 of 2

Charlie Hull — Wed, 27 Apr 2011 09:55:51 +0000

As promised here’s a writeup of the day itself. I’ve split this into two parts.

The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.

Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.

Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.

After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.

We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.

The post ECIR 2011 Industry Day – part 1 of 2 appeared first on Flax.