events – Flax

Little Mermaids, Haystacks and moving on

Charlie Hull — Fri, 15 Feb 2019 09:47:25 +0000

As I announced recently Flax is joining OpenSource Connections, and I recently spent a very pleasant week in Virginia with my new colleagues discussing our plans for the year to come. Without giving too much away I can say that this is a very exciting time to be joining OSC: one thing I will be doing soon is starting to write more about OSC’s proven process for supporting our clients as they move up the search relevance curve.

However before then I’ll be at speaking at a few events. At the end of this month I’ll be in Copenhagen to speak on Keeping Search Relevant in a Digital Workplace at the Intrateam conference. This is a fantastic conference on intranets and I’m looking forward to speaking for the second time and joining a very august gathering of speakers. I’m also glad to be returning to both City University and the University of Essex during February and March to talk to students about working in search and information retrieval

In April I’ll be returning to the US for OSC’s Haystack search relevance conference, which was my favourite event of last year – I liked it so much I brought it to London that October. This year we have a fantastic lineup of talks from speakers representing organisations including LexisNexis, Wikimedia Foundation, Eventbrite and Yelp, a new and more capacious venue in downtown Charlottesville, three training options before the main conference (Think Like A Relevance Engineer for Elasticsearch and Solr, and Learning to Rank) and of course the chance to meet, chat with and get to know some of the best search people in the business. Earlybird tickets are available until the end of February and are already selling well, so make your plans to join us soon!

It’s already shaping up to be a busy year – so do keep an eye on this blog and my new home at www.opensourceconnections.com/blog for further news, and if you’d like to know how OSC can help you empower your search team get in touch.

The post Little Mermaids, Haystacks and moving on appeared first on Flax.

Haystack Europe 2018, a brief retrospective

Charlie Hull — Mon, 15 Oct 2018 15:15:49 +0000

It’s been a couple of weeks now since the first Haystack search relevance conference in Europe, which we ran with our partners Open Source Connections (OSC). Just under a hundred people came to the Friends’ House in Euston for a day of talks covering both the business and technical aspects of relevance engineering. Doug Turnbull of OSC started the day by introducing what would be a major theme of the conference, Learning to Rank, and how Bloomberg had used and benefited from open sourcing their LTR plugin for Solr. Karen Renshaw of Zoro (a division of Grainger Global Online) talked about how to tune relevance from a business perspective. Sebastian Russ of Tudock showed how even something as simple as an Excel spreadsheet can be a useful visualisation tool for relevance, while Alessandro Benedetti and Andrea Gazzarini of Sease demonstrated Rated Ranking Evaluator, a complete platform for relevance measurement. After lunch, Torsten Köster & Fabian Klenk of Shopping 24 and consultant René Kriegler described their journey with LTR for an ecommerce site and Agnes Van Belle of Textkernel showed how similar techniques can be applied to recruitment search. Tony Russell-Rose was our last speaker on strategies and tools for managing complex Boolean queries.

My only regret was how little time I had personally to catch up with the attendees, many of whom were from Flax clients past and present – I must have had 20 or 30 very brief chats during the day! Luckily a few of us went on for a drink afterwards and eventually a curry nearby. It was a very long day but from the feedback we’ve recieved so far a very successful one. We hope to make this a regular event on the calendar.

Thanks to all who made the event possible, our speakers and everyone who came – the slides are now available on the event website.

The post Haystack Europe 2018, a brief retrospective appeared first on Flax.

Three weeks of search events this October from Flax

Charlie Hull — Tue, 04 Sep 2018 10:11:56 +0000

Flax has always been very active at conferences and events – we enjoy meeting people to talk about search! With much of our consultancy work being carried out remotely these days, attending events is a great way to catch up in person with our clients, colleagues and peers and to learn from others about what works (and what doesn’t) when building cutting-edge search solutions. I’m thus very glad to announce that we’re running three search events this coming October.

Earlier in the year I attended Haystack in Charlottesville, one of my favourite search conferences ever – and almost immediately began to think about whether we could run a similar event here in Europe. Although we’ve only had a few months I’m very happy to say we’ve managed to pull together a high-quality programme of talks for our first Haystack Europe event, to be held in London on October 2nd. The event is focused on search relevance from both a business and a technical perspective and we have speakers from global retailers and by specialist consultants and authors. Tickets are already selling well and we have limited space, so I would encourage you to register as soon as you can (Haystack USA sold out even after the capacity was increased). We’re running the event in partnership with Open Source Connections.

The next week we’re running a Lucene Hackday on October 9th as part of our London Lucene/Solr Meetup programme. Building on previous successful events, this is a day of hacking on the Apache Lucene search engine and associated software such as Apache Solr and Elasticsearch. You can read up on what we achieved at our last event a couple of years ago – again, space is limited, so sign up soon to this free event (huge thanks to Mimecast for providing the venue and to Elastic for sponsoring drinks and food for an evening get-together afterwards). Bring a laptop and your ideas (and do comment on the event page if you have any suggestions for what we should work on).

We’ll be flying to Montreal soon afterwards to attend the Activate conference (run by our partners Lucidworks) and while we’re there we’ll host another free Lucene Hackday on October 15th. Again, this would not be possible without sponsorship and so thanks must go to Netgovern, SearchStax and One More Cloud. Remember to tell us your ideas in the comments.

So that’s three weeks of excellent search events – see you there!

The post Three weeks of search events this October from Flax appeared first on Flax.

Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018

Charlie Hull — Mon, 18 Jun 2018 13:53:27 +0000

I spent last week in a sunny Berlin for the Berlin Buzzwords event (and subsequently MICES 2018, of which more later). This was my first visit to Buzzwords which was held in an arts & culture complex in an old brewery north of the city centre. The event was larger than I was expecting at around 550 people with three main tracks of talks. Although due to some external meetings I didn’t attend as many talks as I would have liked, here are a few highlights. Many of the talks have slides provided and some are now also available on the Buzzwords Youtube channel.

Giovanni Fernandez-Kincade talked about query understanding to improve both recall and precision for searches. He made the point that users and documents often speak very different languages which can lead to a lack of confidence in the search engine. Various techniques are available to attempt to translate the user’s intention into a suitable query and these can be placed on a spectrum from human-powered (e.g. creating an exception list to prevent stemming of proper nouns) to some degree of automation (e.g. harvesting data to build lists of synonyms) to fully automation (machine learning of how queries map to documents). Obviously these also fit on other scales from labour-intensive to hands-off and easy to hard in terms of the technology skills required. This talk gave a solid base understanding of the techniques available.

I dropped in on Suneel Marthi’s talk on detecting tulip fields from satellite images, which was fascinating although outside my usual area of search engine technology. I then heard Nick Burch describe the many ways that text extraction powered by Apache Tika can crash your JVM or even your entire cluster (potentially expensive in an elastically-scaling situation as more resources are automatically allocated!). As he recommended one should expect failure and plan accordingly, ship logs somewhere central for analysis and never run Tika inside your Solr instance itself in a production system (a recommendation that has finally made it to the Solr Wiki). Doug Turnbull and Tommaso Teofili then spoke on The Neural Search Frontier, a wide-ranging and in some places somewhat speculative discussion of techniques to improve ranking using word embeddings described by multidimensional vectors. This approach combined traditional IR techniques with neural models to learn whether a document is relevant to a query. One fascinating idea was the use of recurrent neural networks, much used in translation applications, to ‘translate’ a document to a predicted query. As with most of Doug’s talks this gave us a lot to think about but he finished with a plea for better native vector support in Lucene-based search engines.

The next talk I heard was from Varun Thacker on Solr autoscaling which I know is a particular concern of some of our clients as their data volumes grow. These new features in Solr version 7 allow policies and preferences to be set up to govern autoscaling behaviour, where shards may be moved and new cores created automatically based on metrics such as disk space or queries-per-second. One interesting line of questioning from the audience was how to avoid replicas from ‘ping ponging’ between hosts – e.g moving from a node with low disk space to one with more disk space, but then causing a reduction in disk space on the target node, leading to another move. Usefully the autoscaling system can be set to compute a list of operations but leave execution to a human operator, which may help prevent this problem.

The next day I attended Tomás Fernández Löbbe’s talk on new replica types in Solr 7, which talked about the advantages of the ‘Master/Slave’ model for search cluster design as opposed to the standard SolrCloud ‘every node does everything’ model. The new replica types PULL and TLOG allow one to build a master/slave setup in SolrCloud, separating responsibility for indexing and searching and even choosing which type of replica to use in queries. I also heard Houston Putman talk about data analytics with Solr, describing how built-in Solr functions can carry out the type of analytics previously only possible with Apache Spark or Hadoop and avoiding the extra cost of shipping data out of Solr. Unfortunately that was the end of my conference due to some other commitments but it was great to catch up with various search people from Europe and further abroad and to enjoy what was a well-organised and interesting event.

The post Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018 appeared first on Flax.

Search Solutions 2017 review

Charlie Hull — Thu, 14 Dec 2017 15:33:19 +0000

Search Solutions is one of my favourite search events of the year – small, focused and varied, with presentations from both the largest and smallest players in the world of search, drawn from both industry and academia.

This year’s event started with Edgar Meij of Bloomberg, who Flax have helped in the past with their large-scale search and alerting systems. I’d seen most of the details in this talk before so I won’t dwell on them but will thank Bloomberg again for their commitment and contributions to the open source community, particularly to Solr and our Luwak stored search library. Mark Fea of LexisNexis was up next with a talk about taxonomies and how they have built a semi-automated classification system combining supervised machine learning and Boolean rules-based systems: a pragmatic approach to combine the strengths of both approaches as machine learning isn’t always as clever as one might want, and Boolean rules can be hard to build and maintain. Like Bloomberg they are working at large scale: Mark mentioned taxonomies of 21,000 terms and 9 levels, applied to over 1 billion documents.

Mark Harwood of Elastic was up next with one of his always fascinating talks on discovering unknown patterns in data with Elasticsearch. He showed how he had explored ‘toxic’ content (far-right music and those who like it) and fake reviews on Amazon with some great visual demonstrations. An interesting conclusion was how ‘bad actors’ make strange, recognisable shapes in visualised data. [Mark later won the Best Presentation award, richly deserved!]. Anna Kolliakou of King’s College London spoke next on ‘veracity intelligence’ tools to help monitor terms connected to mental health across news media and social networks: an interesting example was ‘mephedrone’ around the time of reclassification of this particular recreational drug. Next up was independent consultant Phil Bradley with a detailed, well-researched and passionate talk on fake news and how one cannot trust any web search engine to present the full picture. Phil is obviously extremely concerned about this issue and his talk spurred discussion amongst the audience about how user education is essential to counter the usual viewpoint of ‘it’s on Google, it must be true’.

Coincidentally, Filip Radlinski of Google started the next session, describing a model for conversation information retrieval. He spoke about how the user and IR system reveal information about themselves as the conversation progresses, how the system may need a memory of past interactions and how it may present a set of potential answers. This is a useful model for the future, although most current ‘conversational’ systems are simplistic. Fabrizio Silvestri then spoke on the various types of search Facebook provides, mostly related to finding people but also images, video and news. He explained how every search operation needs to consider privacy and how Facebook use query rewriting to expand enhance the terms provided by the user. Nicola Cancedda of Microsoft was next with a talk on automated query extraction from emails, to help the user find and attach relevant documents in response (for example, after a colleague asks ‘can you send me the cost projections for 2017’). Her work involves training machine learning models after extracting candidate terms with high TF/IDF values from the email. [Interestingly this reminded me of work I carried out nearly 20 years ago on an email signature that when clicked would search for content relevant to the email – although this relied on Javascript working in an email client which is rather a security problem!].

Last of our scheduled talks was from Mark Stanger of Search Technologies (recently acquired by Accenture) about their work on Elsevier’s DataSearch platform. He described how they developed a Phrase Service that identifies phrases in the user’s query using various methods including acronym detection, dictionary lookup and natural language processing, then expands these phrases as necessary to provide enhanced search. After identifying these key terms they can be boosted appropriately for search (DataSearch itself is based on Solr).

The DataSearch project is impressive, and later on it won the Best Search Project award (I am proud to say I served as part of the judging panel for these awards this year). The other winner of most promising search startup Search|hub by CXP Commerce Experts GmbH.

We finished with some lightning talks and a brief Fishbowl session, dominated this time by discussions on Fake News and how it affects the world of search technology. Thanks to the BCS IRSG again for a fascinating and enlightening day.

The post Search Solutions 2017 review appeared first on Flax.

London Lucene/Solr Meetup – Learning to Rank and Hibernate Search

Charlie Hull — Wed, 24 Feb 2016 10:49:38 +0000

Back to the very impressive Bloomberg lecture theatre for this month’s Lucene/Solr Meetup, with an good turnout (I’m guessing 60-70 people). Our first talk came from Diego Ceccarelli of Bloomberg on how his team have created a Solr implementation of Learning to Rank, an improved way to rank search results using machine learning. Diego first took us through the basics of Lucene’s ranking methods, based on the venerable TF/IDF algorithm (although note that BM25 will be the default very soon). Bloomberg’s implementation first retrieves 1000 search results using standard TF/IDF (which is fast) and then extracts ‘features’ (a simple example might be ‘does the title match the search query?’) which are then fed to a machine learning model. This model is then used to re-rank the 1000 initial results and the top 10 supplied to the user. Interestingly, they have chosen to implement the features as Lucene queries, allowing for easy re-use. Initial tests have shown some metrics such as ‘clicks on the first result’ up by 10%, which is encouraging. There is now a Solr patch (SOLR-8542) which they hope to commit to Solr soon, and you can find slides and a video of a previous presentation on this topic online. I first heard about Learning to Rank from Microsoft Research some years ago and it’s great to see an open source implementation.

Next Sanne Grinovero of RedHat talked about Hibernate Search, an implementation of full-text search for users of this Java ORM. He gave us some great examples of how relational databases can be bad at full text search and thus the need for a full-text engine like Lucene. His implementation hides some of the finer details of Lucene but allows use of advanced Lucene API calls where necessary, and automatically keeps the Lucene index in sync with a relational database. A simple query DSL is available which he demonstrated in use for indexing and querying Twitter data. He then told us about Infinispan, a highly scalable key-value store which can also be used for storing Lucene indexes and mentioned ongoing work to add Elasticsearch and Solr integration.

We finished with a brief informal Q&A session outside; thanks to both presenters and to my co-hosts at Bloomberg for helping to organise the event. We hope to run another Meetup in a couple of months – as ever, offers of talks, a venue and sponsorship of snacks & drinks are very welcome!

The post London Lucene/Solr Meetup – Learning to Rank and Hibernate Search appeared first on Flax.

London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg

Charlie Hull — Wed, 16 Dec 2015 16:21:32 +0000

This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and business. We started with Gabriella Kazai of Lumi, talking about how they have built a crowd-curated content platform for around 80,000 users whose interests and recommendations are mined so as to recommend content to others. Using Elasticsearch as a base, the system ingests around 100 million tweets a day and follows links to any quoted content, which is then filtered and analyzed using a variety of techniques including NLP and NER to produce a content pool of around 60,000 articles. I’ve been aware of Lumi since our ex-colleague Richard Boulton worked there but it was good to understand more about their software stack.

Next was Miguel Martinez-Alvarez of Signal, who are also dealing with huge amount of data on a daily basis – over a million documents a day from over 100,000 sources plus millions of blogs. Their ambition is to analyse “all the worlds’ news” and allow their users to create complex queries over this – “all startups in London working on Machine Learning” being one example. Their challenges include dealing with around 2/3rd of their ingested news articles being duplicates (due to syndicated content for example) and they have built a highly scalable platform, again with Elasticsearch a major part. Miguel talked in particular about how Signal work closely with academic researchers (including Professor Udo Kruschwitz of the University of Essex, with whom I will be collaborating next year) to develop cutting-edge analytics, with an Agile Data Science approach that includes some key evaluation questions e.g. Will it scale? Will the accuracy gain be worth the extra computing power?

Our last talk was from Miles Osborne of our hosts Bloomberg, who have recently signed a deal with Twitter to be able to ingest all past and forthcoming tweets – now that’s Big Data! The object of Miles’ research is to identify tweets that might affect a market and can thus be traded on, as early as possible after an event happens. His team have noticed that these tweets are often well-written (as opposed to the noise and abbreviations in most tweets) and seldom re-tweeted (no point letting your competitors know what you’ve spotted). Dealing with 500m tweets a day, they have developed systems to filter and route tweets into topic streams (which might represent a subject, location or bespoke category) using machine learning. One approach has been to build models using ‘found’ data (i.e. data that Bloomberg already has available) and to pursue a ‘simple is best’ methodology – although one model has 258 million features! Encouragingly, the systems they have built are now ‘good enough’ to react quickly enough to a crisis event that might significantly affect world markets.

We finished with networking, drinks and snacks (amply provided by our generous hosts) and I had a chance to catch up with a few old contacts and friends. Thanks to the organisers for a very interesting evening and the last event of this year for me – see you in 2016!

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

Out and about in search & monitoring – Autumn 2015

Charlie Hull — Wed, 16 Dec 2015 10:24:42 +0000

It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress with our client Infomedia about how we’ve helped them to migrate their media monitoring platform from the elderly, unsupported and hard to scale Verity software to an open source system based on our own Luwak library. We also replaced Autonomy IDOL with Apache Solr and helped Infomedia develop their own in-house query language, to prevent them becoming locked-in to any particular search technology. Indexing over 75 million news stories and running over 8000 complex stored queries over every new story as it appears, the new system is now in production and Infomedia were kind enough to say that ‘Flax’s expert knowledge has been invaluable’ (see the slides here). We celebrated after our talk at a spectacular Bollywood-themed gala dinner organised by Ninestars Global.

The week after I spoke at the Elasticsearch London Meetup with our client Westcoast on how we helped them build a better product search. Westcoast are the UK’s largest privately owned IT supplier and needed a fast and scalable search engine they could easily tune and adjust – we helped them build administration systems allowing boosts and editable synonym lists and helped them integrate Elasticsearch with their existing frontend systems. However, integrating with legacy systems is never a straightforward task and in particular we had to develop our own custom faceting engine for price and stock information. You can find out more in the slides here.

Search Solutions, my favourite search event of the year, was the next day and I particularly enjoyed hearing about Google’s powerful voice-driven search capabilities, our partner UXLab‘s research into complex search strategies and Digirati and Synaptica‘s complimentary presentations on image search and the International Image Interoperability Framework (a standard way to retrieve images by URL). Tessa Radwan of our client NLA media access spoke about some of the challenges in measuring similar news articles (for example, slightly rewritten for each edition of a daily newspaper) as part of the development of the new version of their Clipshare system, a project we’ve carried out over the last year of so. I also spoke on Test Driven Relevance, a theme I’ll be expanding on soon: how we could improve how search engines are tested and measured (slides here).

Thanks to the organisers of all these events for all their efforts and for inviting us to talk: it’s great to be able to share our experiences building search engines and to learn from others.

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

Elasticon London 2015 – more products, more scale, more users!

Charlie Hull — Mon, 09 Nov 2015 11:49:58 +0000

Last week Elastic, the company behind Elasticsearch, landed in London for one of their current series of one-day events. The £50 entrance fee has been put to good use, raising £16750 for AbilityNet who work on accessible IT – a very generous offer by Elastic.

Shay Banon, creator of Elasticsearch, kicked off with a brief history of the project which started when he built the Compass search engine, pretty much as a hobby project while his wife was training as a chef in London. Things have moved on somewhat: today there is a 35,000 strong community with over 35 million downloads of the Elasticsearch software and a number of high-profile users including NASA, Wikimedia and Verizon (who apparently have an impressive 500 billion items indexed).

Clinton Gormley led the next session, talking about new features in the recent 2.0 release. Resiliency, performance and analytics were major themes, with the latter leveraging Lucene’s DocValues as an off-heap column store to build various prediction and detection capabilities. Also mentioned was a new scriptable Ingest Node incorporating parts of the Logstash project. Steve Mayzak then told us about the new version 4 of the Kibana visualisation package, which has now grown in a general UI framework incorporating D3.js for charting and providing an extension API. Shay returned to tell us more about Logstash, which provides over 200 plugins for ingesting data into Elasticsearch. Next up was Uri Boness telling us about the various closed-source parts of the Elasticsearch ecosystem (including the Marvel performance monitor and Shield secuurity module) and we then heard from Morten Ingebrigtsen of Found (a hosted Elasticsearch solution, who Elastic acquired a while ago). For me the most interesting item here was news of an on-premise version of Found Premium – yes, like Lucidworks Fusion, you can now buy a packaged open source search engine from Elastic as a product. This isn’t something we generally recommend as it does remove one of the key advantages of open source, which is the lack of vendor lock-in, but it’s interesting to see Elastic plough such a familiar furrow.

The afternoon consisted of case studies including The Guardian (which I’ve written about previously), a good talk from Jay Chin on using Elasticsearch for Grid Computing for the financial services sector and a couple of use cases from Goldman Sachs. We also heard about the elasticsearch-hadoop connector – note that for high-performance indexing this may not be the best option. I missed a couple of the other talks due to a phone call but returned to hear Shay again, with a controversial statement that ‘the top 8 Lucene committers now work for Elastic’ – how exactly are you measuring that and have you told the other committers? He did however conclude reassuringly with ‘we’re not trying to force anyone to use commercial versions [of Elasticsearch]’ – good to hear!

By the way, if you want to hear how we helped a billion-pound UK IT supplier use Elasticsearch for their e-commerce website, we’ll be presenting with them at the Elasticsearch London Meetup later this month.

The post Elasticon London 2015 – more products, more scale, more users! appeared first on Flax.

Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning

Charlie Hull — Wed, 04 Nov 2015 11:48:33 +0000

I’ll be speaking at several events over the next few weeks, in the UK and abroad. On the 19th of November I’ll be at the FIBEP World Media Intelligence Congress in Vienna, to talk about how we helped our client Infomedia migrate from a closed-source search engine (Autonomy IDOL and Verity) to a new platform based on Apache Lucene/Solr and our own Luwak stored search library. Infomedia are Denmark’s leading provider of media monitoring and analysis and wanted to future-proof their search platform: we’ll talk about open source makes this possible and how we implemented stored search, handled highly complex queries and how the new platform is scalable and flexible.

On the 25th I’ll be presenting at the London Elasticsearch Usergroup with our client Westcoast, who we have been helping with an Elasticsearch implementation. Westcoast are a B2B supplier of electronics and white goods with yearly revenues of over £1billion, and we’ve helped them implement a powerful new search engine for their website. E-commerce is one sector where good search is an essential part of driving revenue.

Next, on the 26th I’ll be talking one of my favourite events of the year, the British Computer Society Information Retrieval Specialist Group’s Search Solutions, on how we might improve how search engine relevance is tested. I’ll suggest a more formal process of test-based relevance tuning and show some useful tools. Our client NLA media access are also talking about the new Clipshare platform we built on Apache Lucene/Solr.

Do let me know if you’re attending and would like to chat – I’ll also be publishing slides and more information about the projects above soon.

The post Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning appeared first on Flax.