This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and business. We started with Gabriella Kazai of Lumi, talking about how they have built a crowd-curated content platform for around 80,000 users whose interests and recommendations are mined so as to recommend content to others. Using Elasticsearch as a base, the system ingests around 100 million tweets a day and follows links to any quoted content, which is then filtered and analyzed using a variety of techniques including NLP and NER to produce a content pool of around 60,000 articles. I’ve been aware of Lumi since our ex-colleague Richard Boulton worked there but it was good to understand more about their software stack.
Next was Miguel Martinez-Alvarez of Signal, who are also dealing with huge amount of data on a daily basis – over a million documents a day from over 100,000 sources plus millions of blogs. Their ambition is to analyse “all the worlds’ news” and allow their users to create complex queries over this – “all startups in London working on Machine Learning” being one example. Their challenges include dealing with around 2/3rd of their ingested news articles being duplicates (due to syndicated content for example) and they have built a highly scalable platform, again with Elasticsearch a major part. Miguel talked in particular about how Signal work closely with academic researchers (including Professor Udo Kruschwitz of the University of Essex, with whom I will be collaborating next year) to develop cutting-edge analytics, with an Agile Data Science approach that includes some key evaluation questions e.g. Will it scale? Will the accuracy gain be worth the extra computing power?
Our last talk was from Miles Osborne of our hosts Bloomberg, who have recently signed a deal with Twitter to be able to ingest all past and forthcoming tweets – now that’s Big Data! The object of Miles’ research is to identify tweets that might affect a market and can thus be traded on, as early as possible after an event happens. His team have noticed that these tweets are often well-written (as opposed to the noise and abbreviations in most tweets) and seldom re-tweeted (no point letting your competitors know what you’ve spotted). Dealing with 500m tweets a day, they have developed systems to filter and route tweets into topic streams (which might represent a subject, location or bespoke category) using machine learning. One approach has been to build models using ‘found’ data (i.e. data that Bloomberg already has available) and to pursue a ‘simple is best’ methodology – although one model has 258 million features! Encouragingly, the systems they have built are now ‘good enough’ to react quickly enough to a crisis event that might significantly affect world markets.
We finished with networking, drinks and snacks (amply provided by our generous hosts) and I had a chance to catch up with a few old contacts and friends. Thanks to the organisers for a very interesting evening and the last event of this year for me – see you in 2016!