The post FIBEP WMIC 2015 – Open source search for media monitoring with Solr appeared first on Flax.
]]>The post FIBEP WMIC 2015 – Open source search for media monitoring with Solr appeared first on Flax.
]]>The post Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning appeared first on Flax.
]]>On the 25th I’ll be presenting at the London Elasticsearch Usergroup with our client Westcoast, who we have been helping with an Elasticsearch implementation. Westcoast are a B2B supplier of electronics and white goods with yearly revenues of over £1billion, and we’ve helped them implement a powerful new search engine for their website. E-commerce is one sector where good search is an essential part of driving revenue.
Next, on the 26th I’ll be talking one of my favourite events of the year, the British Computer Society Information Retrieval Specialist Group’s Search Solutions, on how we might improve how search engine relevance is tested. I’ll suggest a more formal process of test-based relevance tuning and show some useful tools. Our client NLA media access are also talking about the new Clipshare platform we built on Apache Lucene/Solr.
Do let me know if you’re attending and would like to chat – I’ll also be publishing slides and more information about the projects above soon.
The post Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning appeared first on Flax.
]]>The post Eleven years of open source search appeared first on Flax.
]]>When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.
So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.
The post Eleven years of open source search appeared first on Flax.
]]>The post Media monitoring with open source search – 20 times faster than before! appeared first on Flax.
]]>As our client had a large investment in stored searches (which represent a client’s interests) which were defined in the query language of their previous search engine, we first had to build a modified version of Apache Lucene that replicated exactly this syntax. I’ve previously blogged about how we did this. However this wasn’t the only challenge: search engines are designed to be good at applying a few queries to a very large document collection, not necessarily at applying tens of thousands of stored queries to every single new document. For media monitoring applications this kind of performance is essential as there may be hundreds of thousands of news articles to monitor every day. The system we’ve built is capable of applying tens of thousands of stored queries every second.
With the rapid increase in the volume of content that media monitoring companies have to check for their clients – today’s news isn’t just in print, but online, in social media and indeed multimedia – it may be that open source software is the only way to build monitoring systems that are economically scalable, while remaining accurate and flexible enough to deliver the right results to clients.
The post Media monitoring with open source search – 20 times faster than before! appeared first on Flax.
]]>The post Search backwards – media monitoring with open source search appeared first on Flax.
]]>There are several ways around this problem: for example in most cases you don’t need to monitor every article for every client, as they will have told you they’re only interested in certain sources (for example, a car manufacturer might want to keep an eye on car magazines and the reviews in the back page of the Guardian Saturday magazine, but doesn’t care about the rest of the paper or fashion magazines). However, pre-filtering queries in this way can be complex especially when there are so many potential sources of data.
We’ve recently managed to develop a method for searching incoming articles using a brute-force approach based on Apache Lucene which in early tests is performing very well – around 70,000 queries applied to a single article in around a second on a standard MacBook. On suitable server hardware this would be even faster – and of course you have all the other features of Lucene potentially available, such as phrase queries, wildcards and highlighting. We’re looking forward to being able to develop some powerful – and economically scalable – media monitoring solutions based on this core.
The post Search backwards – media monitoring with open source search appeared first on Flax.
]]>The post Search Solutions 2011 review appeared first on Flax.
]]>The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.
Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.
After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.
The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling‘s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.
It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.
The post Search Solutions 2011 review appeared first on Flax.
]]>The post Bicycles, beer and bands – the first Cambridge Enterprise Search Meetup appeared first on Flax.
]]>We started with my presentation on “Searching news media with open source software”, where I talked about our work for the NLA, Financial Times and others. We followed with John Snyder of Grapeshot on “Using Search to Connect Multiple Variants of An Object to One Central Object”. John showed a Grapeshot project for Virgin where different media assets can be automatically grouped together even if they have different metadata – for example an episode of the TV show “Heroes” is basically the same object whether it is broadcast, video-on-demand or a repeat, but differs from the Bowie album of the same name.
We then broke up for discussion (and beer) – great to catch up with some ex-colleagues and meet others for the first time. Downstairs there was live music and one of our colleagues even joined the band for a spell on drums! From the feedback we recieved there’s definitely interest in repeating the event, if you’d like to attend next time please join the Meetup group.
The post Bicycles, beer and bands – the first Cambridge Enterprise Search Meetup appeared first on Flax.
]]>The post Next-generation media monitoring with open source search appeared first on Flax.
]]>We’ve been working with Durrants Ltd. of London for a while now on replacing their existing (closed source) search engine with a system built on open source. This project, which you can read more about in a detailed case study (PDF), has reduced the hardware requirements significantly and led to huge accuracy improvements (in some cases where 95% of the results passed through to human operators were irrelevant ‘false positives’, the new system is now 95% correct).
The new system is built on Xapian and Python and supports all the features of the previous engine, to ease migration – it even copes with errors introduced during automated scanning of printed news. The new system scales easily and cost effectively.
As far as we know this is one of the first large-scale media monitoring systems built on open source, and a great example of search as a platform, which we’ve discussed before.
The post Next-generation media monitoring with open source search appeared first on Flax.
]]>The post Autumn events appeared first on Flax.
]]>Next is the British Computer Society’s Search Solutions 2010 in London on October 21st, where I’m giving a presentation titled “What’s the story with open source? – Searching and monitoring news media with open-source technology”.
Both events feature a wide range of other speakers from organisations such as Cisco, LinkedIn, Twitter, Google and Microsoft.
The post Autumn events appeared first on Flax.
]]>The post The Times they are a-changing…. appeared first on Flax.
]]>The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.
The post The Times they are a-changing…. appeared first on Flax.
]]>