London Lucene/Solr Meetup – Introducing Marple & Solr Classification

Charlie Hull — Mon, 27 Mar 2017 13:16:36 +0000

A small crowd for this month’s London Lucene/Solr Meetup, kindly hosted by Barclays in their sumptuous Canary Wharf offices. I introduced the Meetup and spoke briefly on how Flax is currently looking for team members (want to work on a variety of cutting-edge open source search projects in the UK and abroad? Get in touch!) before introducing Flax’s Alan Woodward who introduced our new Lucene index inspection tool, Marple.

Alan told us how Marple was conceived at the Lucene4IR event in Glasgow last year and how coding started at our Lucene Hackday in London. Although the well-known tool Luke allows one to dive deep into Lucene indexes, it hasn’t kept up with recent additions to Lucene index structures and we also wanted to build a tool with a RESTful API and separate GUI to allow it to be run easily on our client’s indexes in a read-only mode. Alan demonstrated Marple’s features including how it allows one to see the ‘hidden’ Lucene index fields that Elasticsearch creates. The first release of Marple is out and we’d welcome any feedback and contributions.

Next up was Alessandro Benedetti with an engaging talk about Solr’s built-in document classification features, useful for everything from spam filtering to automatic product categorisation. Unlike many classification methods, this uses the Lucene index itself as the training set – this index must contain some documents with manually assigned classification fields. Either K-Nearest-Neighbour and Naive Bayes algorithms can be used to perform the classification via Solr’s UpdateRequestProcessor chain, in Solr versions after 6.1. You can read more detail on Alessandro’s excellent blog.

We concluded with a brief Q&A session and then popped downstairs to a pub for some snacks and drinks. Thanks to both our speakers, our hosts and all who came – we’ll return in a couple of months with talks that will include René Kriegler on his neat Querqy query processor.

The post London Lucene/Solr Meetup – Introducing Marple & Solr Classification appeared first on Flax.

Clade – a freely available, open source taxonomy and autoclassification tool

Charlie Hull — Tue, 12 Jun 2012 08:44:33 +0000

One way to manage digital information is to classify it into a series of categories or a heirarchical taxonomy, and traditionally this was done manually by analysts, who would examine each new document and decide where it should fit. Building and maintaining taxonomies can also be labour intensive, as these will change over time (for a simple example, just consider how political parties change and divide, with factions appearing and disappearing). Search engine technology can be used to automate this classification process and the taxonomy information used as metadata, so that search results can be easily filtered by category, or automatically delivered to those interested in a particular area of the heirarchy.

We’ve been working on an internal project to create a simple taxonomy manager, which we’re releasing today in a pre-alpha state as open source software. Clade lets you import, create and edit taxonomies in a browser-based interface and can then automatically classify a set of documents into the heirarchy you have defined, based on their content. Each taxonomy node is defined by a set of keywords, and the system can also suggest further keywords from documents attached to each node.

This screenshot shows the main Clade UI, with the controls:

A – dropdown to select a taxonomy
B – buttons to create, rename or delete a taxonomy
C – the main taxonomy tree display
D – button to add a category
E – button to rename a category
F – button to delete a category
G – information about the selected category
H – button to add a category keyword
I – button to edit a keyword
J – button to toggle the sense of a keyword
K – button to delete a keyword
L – suggested keywords
M – button to add a suggested keyword
N – list of matching document IDs
O – list of matching document titles
P – before and after document ranks

Clade is based on Apache Solr and the Stanford Natural Language Processing tools, and is written in Python and Java. You can run it on on either Unix/Linux or Windows platforms – do try it and let us know what you think, we’re very interested in any feedback especially from those who work with and manage taxonomies. The README file details how to install and download it.

The post Clade – a freely available, open source taxonomy and autoclassification tool appeared first on Flax.

classification – Flax

London Lucene/Solr Meetup – Introducing Marple & Solr Classification

Clade – a freely available, open source taxonomy and autoclassification tool