Meetup at Big Data London – One-click Solr & Factchecking with Solr

Charlie Hull — Thu, 10 Nov 2016 11:22:26 +0000

Last week I spoke at the Big Data London conference, a very busy event with several thousand people attending. My session was on using open source search to make sense of Big Data – you can get slides here.

In the evening we ran another Lucene/Solr London Usergroup event with speakers Upayavira and Full Fact. After a brief but friendly fight with the Datastax team over pizza we settled down to see Upayavira show us his method for creating a fully functional SolrCloud stack and search application with a single command line using tools such as Docker, Rancher and Exhibitor. Upayavira’s system only needs to be given details of an Amazon Web Services cloud hosting account and it will create host instances, install and start Zookeeper, wait for a quorum to be established, install and start Solr and create a SolrCloud cluster and finally install and start a search application. The whole thing is managed by his own script Uberstack and is undeniably impressive.

Our second talk (and I think my favourite talk from all our Solr Meetups) was from Will Moy and Mevan Babakar of Full Fact, a charity who monitor the news for accuracy (something we increasingly require in these ‘post-truth’ days). Will told us how false and misleading claims can be amplified by the media and may end up directly influencing government policy, even though the underlying facts are wrong. FullFact are attempting to build open source, freely available systems for automating the factchecking process using Apache Lucene/Solr and our own stored query library Luwak and Flax have been donating some time to help them with this process. Their Hawk system currently indexes over 70 million sentences. This project is a wonderful example of how free, open source software can be used to create tools that benefit us all and at the end of this inspiring talk many of the audience offered ideas and even direct assistance with the project. I urge you to read Full Fact’s recent report on automated factchecking and get involved if you can. One idea was to run a Hackday for Full Fact – more details when we have them.

Thanks to Big Data London for inviting me to speak and hosting the Meetup and to Elsevier for sponsoring pizza and drinks. We’ll be back with another Meetup soon!

The post Meetup at Big Data London – One-click Solr & Factchecking with Solr appeared first on Flax.

Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara

Charlie Hull — Thu, 18 Feb 2016 11:42:07 +0000

Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at massive scale, with search being only part of the picture.

Joseph Francis from Skyscanner began with a talk about how they’ve developed a streaming data system to replace a monolithic SQL database for reporting and monitoring. Use cases include creating user timelines, data enrichment, JOINs and windowed aggregations and his team aim to provide a system that in-house developers can easily use for all kinds of analytics tasks. The system uses Apache Kafka as a highly scalable pipeline and Apache Samza for stream-based processing, as you can see (hopefully) in this photo of their architecture:
Elasticsearch provides querying capabilities and visualisations using Kibana. Joseph’s team have focused on making the system (and tasks that run on it) easy to deploy and use, with this currently managed using Ansible and TeamCity although they are now moving to a combination of Docker and Drone. As an aside, Skyscanner are also building autosuggest capabilities using Solr.

Next was Bruno Bonacci showing off his analytics system Samsara, inspired by a project to build analytics for Tesco’s HUDL tablet in only six weeks. With this short a timescale, Bruno took a pragmatic approach combining Kafka, Elasticsearch, Kibana and a number of custom components to allow relatively simple – but extremely fast – stream processing. He described how aggregation can either be done at ingestion time (which as you must store all the data you might need in separated chunks can end up taking up huge amounts of storage) or query time (which is far more flexible especially when you don’t yet know what questions you’ll need to answer). His custom processing module, Samsara Core, doesn’t use a built-in database for storing state (as Samza does) but rather uses an in-memory key-value store. For resiliency, this creates a log which is emitted as a Kafka stream. His approach seems to have huge performance implications – he has demonstrated Samsara running on a single core to be 72 times faster than a 4-core Spark Streaming system. Bruno and his team have released Samsara as open source and are working on new processing modules including sentiment analysis and classification. This is a fascinating project and a sign of the increasing need for high-performance streaming analytics. It would be interesting to see if our own work combining our stored query library Luwak with Samza could be combined with Samsara.

Thanks to Alex Dean of Snowplow for organising a very interesting evening and of course, to both the speakers.

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

docker – Flax

Meetup at Big Data London – One-click Solr & Factchecking with Solr

Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara