The post London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs appeared first on Flax.
]]>Relevance is a major concern for this kind of system and Elsevier have developed many strategies for relevance tuning. Features such as highlighting and auto-suggest are used, lemmatisation rather than stemming (with scientific data, stemming can cause issues such as turning ‘Age’ into ‘Ag’ – the chemical symbol for silver) and a custom rescoring algorithm that can be used to promote up to 3 data results to the top of the list if deemed particularly relevant. Elsevier use both search logs and test queries generated by subject matter experts to feed into a custom-built judgement tool – which they are hoping to open source at some point (this would be a great complement to Quepid for test-based relevance tuning)
Peter also described a strategy for automatic optimization of the many query parameters available in Solr, using machine learning, based on some ideas first proposed by Simon Hughes of dice.com. Elsevier have also developed a Phrase Service API, which helps improve phrase based search over the standard un-ordered ‘bag of words’ model by recognising acronyms, chemical formulae, species, geolocations and more, expanding the original phrase based on these terms and then boosting them using Solr’s query parameters. He also mentioned a ‘push API’ available for data providers to push data directly into DataSearch. This was a necessarily brief dive into what is obviously a highly complex and powerful search engine built by Elsevier using many cutting-edge ideas.
Our next speaker, Michael Hardwick of Elite Software, talked about how textual data is stored in PDF files and the implications for extracting this data for search applications. In an engaging (and at some times slightly horrifying) talk he showed how PDFs effectively contain instructions for ‘painting’ characters onto the page and how certain essential text items such as spaces may not be stored at all. He demonstrated how fonts are stored within the PDF itself, how character encodings may be deliberately incorrect to prevent copy-and-paste operations and in general how very little if any semantic information is available. Using newspaper content as an example he showed how reading order is often difficult to extract as the PDF layout is a combination of the text from the original author and how it has been laid out on the page by an editor – so the headline may be have been added after the article text, which itself may have been split up into sections.
Tables in PDFs were described as a particular issue when attempting to extract numerical data for re-use – the data order may not be in the same order as it appears, for example if only part of a table is updated each week a regular publication appears. With PDF files sometimes compressed and encrypted the task of data extraction can become even more difficult. Michael laid out the choices available to those wanting to extract data: optical character recognition, a potentially very expensive Adobe API (that only gives the same quality of output as copy-and-paste), custom code as developed by his company and finally manual retyping, the latter being surprisingly common.
Thanks to both our speakers and our hosts Elsevier – we’re planning another Meetup soon, hopefully in mid to late June.
The post London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs appeared first on Flax.
]]>The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.
]]>Our first talk was from Uwe Schindler, Lucene committer, who started with some history of how previous Java 7 releases had broken Apache Lucene in somewhat spectacular fashion. After this incident the Oracle JDK team and Lucene PMC worked closely together to improve both communications and testing – with regular builds of Java 8 (using Jenkins) being released to test with Lucene. The Oracle team later publically thanked the Lucene committers for their help in finding Java issues. Uwe told us how Java 9 introduced a module system named ‘Jigsaw’ which tidied up various inconsistencies in how Java keeps certain APIs private (but not actually private) – this caused some problems with Solr. Uwe also mentioned how Java’s MMapDirectory feature should be used with Lucene on 64 bit platforms (there’s a lot more detail on his blog) and various intrinsic bounds checking feeatures which can be used to simplify Lucene code. The three main advantages of Java 9 that he mentioned were lower garbage collection times (with the new G1GC collector), more security features and in some cases better query performance. Going forward, Uwe is already looking at Java 10 and future versions and how they impact Lucene – but for now he’s been kind enough to share his slides from the Meetup.
Our second speaker was Andy Hind, head of search at Alfresco. His presentation included the obvious Austin Powers references of course! He described the architecture Alfresco use for search (a recent blog also shows this – interestingly although Solr is used, Zookeeper is not – Alfresco uses its own method to handle many Solr servers in a cluster). The test system described ran on the Amazon EC2 cloud with 10 Alfresco nodes and 20 Solr nodes and indexed around 1.168 billion items. The source data was synthetically generated to simulate real-world conditions with a certain amount of structure – this allowed queries to be built to hit particular areas of the data. 5000 users were set up with around 500 concurrent users assumed. The test system managed to index the content in around 5 days at a speed of around 1000 documnents a second which is impressive.
Thanks to both our speakers and we’ll return soon – if you have a talk for our group (or can host a Meetup) do please get in touch.
The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.
]]>The post London Lucene/Solr Meetup – Introducing Marple & Solr Classification appeared first on Flax.
]]>Alan told us how Marple was conceived at the Lucene4IR event in Glasgow last year and how coding started at our Lucene Hackday in London. Although the well-known tool Luke allows one to dive deep into Lucene indexes, it hasn’t kept up with recent additions to Lucene index structures and we also wanted to build a tool with a RESTful API and separate GUI to allow it to be run easily on our client’s indexes in a read-only mode. Alan demonstrated Marple’s features including how it allows one to see the ‘hidden’ Lucene index fields that Elasticsearch creates. The first release of Marple is out and we’d welcome any feedback and contributions.
Next up was Alessandro Benedetti with an engaging talk about Solr’s built-in document classification features, useful for everything from spam filtering to automatic product categorisation. Unlike many classification methods, this uses the Lucene index itself as the training set – this index must contain some documents with manually assigned classification fields. Either K-Nearest-Neighbour and Naive Bayes algorithms can be used to perform the classification via Solr’s UpdateRequestProcessor chain, in Solr versions after 6.1. You can read more detail on Alessandro’s excellent blog.
We concluded with a brief Q&A session and then popped downstairs to a pub for some snacks and drinks. Thanks to both our speakers, our hosts and all who came – we’ll return in a couple of months with talks that will include René Kriegler on his neat Querqy query processor.
The post London Lucene/Solr Meetup – Introducing Marple & Solr Classification appeared first on Flax.
]]>The post Just the facts with Solr & Luwak appeared first on Flax.
]]>At our recent London Lucene/Solr Meetup UK charity Full Fact spoke eloquently on the need for automated factchecking tools to help identify and correct stories that are demonstrably false. They’ve also published a great report on The State of Automated Factchecking which mentions both Apache Solr and our powerful stored query library Luwak as components of their platform. We’ve been helping FullFact with their prototype factchecking tools for a while now but during the Meetup I suggested we might run a hackday to develop these further.
Thus I’m very pleased to announce that Facebook have offered us a venue in London for the hackday on January 20th (register here). Many Solr developers, including several committers and PMC members, are signed up to attend already. We’ll use Full Fact’s report and their experiences of factchecking newspapers, TV’s Question Time and Hansard to design and build practical, useful tools and identify a future roadmap. We’ll aim to publish what we build as open source software which should also benefit factchecking organisations across the world.
If you’re concerned about the impact of fake news on the political process and want to help, join the Meetup and/or donate to Full Fact.
The post Just the facts with Solr & Luwak appeared first on Flax.
]]>The post A tale of two cities (and two Lucene Hackdays) appeared first on Flax.
]]>Several days later we ran a similar Hackday in Boston, as many Lucene people were in town for Lucene Revolution. Many more Lucene/Solr committers attended this time and enjoyed a chance to work on their own projects or to continue some of the work we’d started in London. Doug Turnbull came up with a way to do BM25F ranking with existing Lucene features while Alexandre Ravalovitch and I had a long conversation about minimal Solr examples and improving the way beginners can start with Solr. Other projects included new field types for Lucene, improved highlighters and DocValues. BA Insight were kind enough to provide the venue and Lucidworks sponsored drinks and snacks later in the pub downstairs.
We’ve gathered notes on what we worked on with links to some of the software we developed here – please do get involved if you can! In particular the Marple project is attracting further contributions (and interest from those who developed and maintain the existing Luke Lucene index inspector).
I’d like to thank everyone who came to the Hackdays, our generous sponsors for providing venues, food and drink and to those who helped organise the events. The feedback has been excellent (and do let us know if you have any further comments) and people seem keen for this to be a regular event before the annual Lucene Revolution conference – a chance to work on Lucene-based projects outside of regular work, to meet, network and spend time with other contributors and to enjoy being part of a great open source community. We’ll be back!
The post A tale of two cities (and two Lucene Hackdays) appeared first on Flax.
]]>The post Search and other events for Autumn 2012 appeared first on Flax.
]]>We’ll be briefly visiting the trade stands at FIBEP 2012 on October 4th in the historic town of Krakow, Poland – this is part of a major media monitoring event, the 45th FIBEP Congress. We’re looking forward to meeting companies in the media monitoring sector and talking about some of our projects in that area.
On November 29th we’re planning to attend Search Solutions 2012 in at the BCS in Covent Garden, London – this is an excellent one-day event on all the technical aspects of search. You can read my review of last year’s event to find out more about what to expect.
There’s sure to be more to come!
The post Search and other events for Autumn 2012 appeared first on Flax.
]]>The post Enterprise Search Europe & a SuperSized Search Meetup appeared first on Flax.
]]>It’s going to be a busy time as I’m also chairing a panel on the first day and helping run the evening reception, which is co-hosted by the London and Cambridge Search Meetups – this is likely to be one of the largest Search Meetups ever and is sure to be a fascinating evening, featuring speakers from the conference in an informal setting (i.e., a pub!).
Hope to see some of you there.
The post Enterprise Search Europe & a SuperSized Search Meetup appeared first on Flax.
]]>