The post Why GCloud search is badly broken & how to fix it appeared first on Flax.
]]>Unfortunately the Cloudstore itself has a search facility that is badly broken. There are several obvious issues: many of the entries created by the larger suppliers have been keyword stuffed – here’s a particularly egregious example from Atos which seems to include most of the terms used in software in the last few years. I found this using the search terms ‘enterprise search’ which produces very few relevant looking results. The online guidance for CloudStore search suggests putting double quotes around my terms (sadly I think few users will think of this) which improves things a little but there are still a lot of irrelevant results – an online conferencing system is fifth for example.
Fortunately all is not lost and in the next iteration of GCloud we are promised major improvements to the search engine. I’m hoping this will include phrase boosting. However, if the big SIs and others are allowed to create the sort of bad-quality content I have shown above, no search engine in the world will be able to sort the wheat from the chaff. It is essential that CloudStore entries are subject to some kind of curation and that keyword stuffing is banned and/or heavily penalised, otherwise SMEs like ourselves will still find it very hard to compete with the big SIs.
Update: it seems there is a new system under construction, and the search works a lot better. Let’s hope it comes out of alpha soon and can be used by purchasers!
The post Why GCloud search is badly broken & how to fix it appeared first on Flax.
]]>The post Better search for e-petitions – handling misspelled content with a Solr phonetic filter appeared first on Flax.
]]>The website is implemented in Ruby on Rails, using the Sunspot Solr client library. There are currently only 22,000 petitions, of no more than a few kilobytes each – easily enough to fit into the cache of a standard server. Despite this, the previous configuration was performing badly, and maxing out 8 CPU cores on a virtual machine under a load of a few hundred queries per second. Retrieval was also poor, with no results at all found for queries like “EU”.
The first thing we did was to install Solr 3.6 (the previous version was the rather elderly 1.4) running in Jetty on Ubuntu. Then we looked at the schema and search implementation. The former was using the standard Sunspot field mappings, which is fine for many applications but in this case was not allowing flexibility of weighting. Searches used the standard query parser to parse a hand-constructed query string with different field weightings and frequent use of the fuzzy match operator (e.g. “leasehold~0.8”). This seemed to be the most likely cause of poor performance under load.
Fuzzy matching had been used because of the frequent misspellings in petition text entered by users (e.g. “marraige” instead of “marriage”). Solr spelling correction on the query is not appropriate here, as correctly-spelled queries may not find misspelled content. But since fuzzy matching was performing badly on a relatively small index, we needed a new approach.
What we came up with was two levels of fields: the first being normalised with lowercasing and KStem but otherwise matching exactly, the second using a PhoneticFilterFactory to perform a Double Metaphone encoding on terms. We hoped that the misspellings in the corpus would transform to the same terms under this filter (e.g. “marriage” and “marraige” both yielding “MJ” etc.) The exact fields should provide precision, the phonetic fields, retrieval. Fields were populated using the copyField directive, without changing the client indexing code. We configured an eDisMax query handler to provide a simple interface and removed the custom query string construction from the client code.
In practice, this worked very well – the new server can handle search loads 5 times or greater compared with the previous one, and the CPUs are never maxed out (despite the server having only 4 cores compared with the previous 8). Ranking and retrieval are also greatly improved, and searches for “EU” return relevant petitions!
Phonetic algorithms are never going to catch all misspellings, and had Solr 4.0 been released at this time (with its very fast fuzzy engine) then it would have been the obvious approach to try. However, for now the search is much better, in less than 2 days of effort.
The post Better search for e-petitions – handling misspelled content with a Solr phonetic filter appeared first on Flax.
]]>The post Cambridge Search Meetup review – Two different kinds of university search appeared first on Flax.
]]>Udo Kruschwitz was next with a brief treatment of his ongoing work on automatically extracting domain knowledge and using this to improve search results (for example see the ‘Suggestions’ on the University of Essex website) – he showed us some of the various methods his team have tried to analyze query logs, including Ant Colony Optimisation (modelling ‘trails’ of queries that can be reinforced by repeat visits, or ‘fade’ over time as they are less used). I liked the concept of developing a ‘community’ search profile where individual search profiles are hard to obtain – and how this could be simply subdivided (so for example searchers from inside a university might have a different profile to those outside). The key idea here is that all these techniques are automatic, so the system is continually evolving to give better search suggestions and hints. Udo and his team are soon to release an open source adaptive search framework to be called “Sunny Aberdeen” which we look forward to hearing about.
The evening ended with networking and a pint or two in traditional fashion – thanks to both our speakers and to all who came, from as far afield as Milton Keynes, Essex and Luton. The group now has 70 members and we’re building an active and friendly local community of search enthusiasts.
The post Cambridge Search Meetup review – Two different kinds of university search appeared first on Flax.
]]>The post Website Redesign appeared first on Flax.
]]>Of course, there are sure to be teething problems – if you find anything that doesn’t work do let us know!
The post Website Redesign appeared first on Flax.
]]>The post Migrating from lemurconsulting.com appeared first on Flax.
]]>The post Migrating from lemurconsulting.com appeared first on Flax.
]]>The post More technical details now available appeared first on Flax.
]]>The post More technical details now available appeared first on Flax.
]]>