hadoop – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/ http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/#comments Thu, 03 Mar 2016 10:01:00 +0000 http://www.flax.co.uk/?p=3055 We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them. The grandfather of them all of course is Apache Hadoop, now not so much … More

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them.

The grandfather of them all of course is Apache Hadoop, now not so much a single project as an ecosystem including storage and processing for potentially huge amounts of data, spread across clusters of machines. Interestingly Hadoop was originally created by Doug Cutting, who also wrote Lucene (the search library used by Apache Solr and Elasticsearch) and the Nutch web crawler. We’ve been helping clients distribute processing tasks using Hadoop’s MapReduce algorithm and also to speed up their indexing from Hadoop into Elasticsearch. Other projects we’ve used in the Hadoop ecosystem include Apache Zookeeper (used to coordinate lots of Solr servers into a distributed SolrCloud) and Apache Spark (for distributed processing).

We’re increasingly using Apache Kafka (a message broker) for handling large volumes of streaming data, for example log files. Kafka provides persistent storage of these streams, which might be ingested and pre-processed using Logstash and then indexed with Elasticsearch and visualised with Kibana to build high-performance monitoring systems. Throughput of thousands of items a second is not uncommon and these open source systems can easily match the performance of proprietary monitoring engines such as Splunk at a far lower cost. Apache Samza, a stream processing framework, is based on Kafka and we’ve built a powerful full-text search for streams system using it. Note that Elasticsearch has a similar ‘stored search’ feature called Percolator, but this is quite a lot slower (as others have confirmed).

Most of the above systems are written in Java, and if not run on the Java Virtual Machine (JVM), so our experience building large, performant and resilient systems on this platform has been invaluable. We’ll be writing in more detail about these projects soon. I’ve always said that search experts have been dealing with Big Data since well before it gained popularity as a concept – so if you’re serious about Big Data, ask us how we could help!

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/feed/ 2
Flax Newsletter November 2015 http://www.flax.co.uk/blog/2015/11/10/flax-newsletter-november-2015/ http://www.flax.co.uk/blog/2015/11/10/flax-newsletter-november-2015/#comments Tue, 10 Nov 2015 11:03:43 +0000 http://www.flax.co.uk/?p=2795 In this month’s Flax Newsletter: Building an open source search team is hard – let us help with training & mentoring on Solr and Elasticsearch RS Components: Flax & Quepid help us to make “crucial” data driven decisions for tuning … More

The post Flax Newsletter November 2015 appeared first on Flax.

]]>
In this month’s Flax Newsletter:

  • Building an open source search team is hard – let us help with training & mentoring on Solr and Elasticsearch
  • RS Components: Flax & Quepid help us to make “crucial” data driven decisions for tuning search
  • 40x faster indexing with Elasticsearch for Hadoop – over a gigabyte per second!

The post Flax Newsletter November 2015 appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/11/10/flax-newsletter-november-2015/feed/ 2
As Hadoop gains, does Lucene benefit? http://www.flax.co.uk/blog/2014/03/27/as-hadoop-gains-does-lucene-benefit/ http://www.flax.co.uk/blog/2014/03/27/as-hadoop-gains-does-lucene-benefit/#respond Thu, 27 Mar 2014 17:21:11 +0000 http://www.flax.co.uk/blog/?p=1176 The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m. Gartner correctly explains that Hadoop isn’t … More

The post As Hadoop gains, does Lucene benefit? appeared first on Flax.

]]>
The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m.

Gartner correctly explains that Hadoop isn’t just one project, but an ecosystem comprising an increasing number of open source projects (and some closed source distributions and add-ons). Once you’ve got your Big Data in a HDFS-shaped pile, there are many ways to make sense of it – and one of those is a search engine, so there’s been a lot of work recently trying to add Lucene-powered search engines such as Apache Solr and Elasticsearch into the mix. There’s also been some interesting partnerships.

I’m thus wondering whether this could signal a significant boost to the development of these search projects: there are already Lucene/Solr committers working at Hadoop-flavoured companies who have been working on distributed search and other improvements to scalability. Let’s hope some of the investment cash goes to search!

The post As Hadoop gains, does Lucene benefit? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/03/27/as-hadoop-gains-does-lucene-benefit/feed/ 0
Cambridge Search Meetup – six degrees of ontology and Elasticsearching products http://www.flax.co.uk/blog/2014/03/07/cambridge-search-meetup-six-degrees-of-ontology-and-elasticsearching-products/ http://www.flax.co.uk/blog/2014/03/07/cambridge-search-meetup-six-degrees-of-ontology-and-elasticsearching-products/#respond Fri, 07 Mar 2014 10:40:39 +0000 http://www.flax.co.uk/blog/?p=1156 Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an … More

The post Cambridge Search Meetup – six degrees of ontology and Elasticsearching products appeared first on Flax.

]]>
Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an engaging talk on ontologies, showing how they can be useful for describing things that don’t easily fit into traditional taxonomies and how they can even be used for connecting Emperor Hirohito of Japan to Kevin Bacon in less than six steps. Along the way we learnt about sea creatures that lose their spines, Zoë’s very Australian dislike of jellyfish and other stinging sea dwellers and her own self-cleaning fish tank at home.

As search developers, we’re often asked to work with both taxonomies and ontologies and the challenge is how to represent them in a flat, document-focused index – perhaps ontologies are better represented by linked data stores such as provided by Apache Marmotta.

Next was Jurgen Van Gael of Rangespan, a company that provide an easy way for retailers to expand their online inventory beyond what is available in brick-and-mortar stores (customers include Tesco, Argos and Staples). Jurgen described how product data is gathered into MongoDB and MySQL databases, processed and cleaned up on a Apache Hadoop cluster and finally indexed using Elasticsearch to provide a search application for Rangespan’s customers. Although indexing of 50 million items takes only 75 minutes, most of the source data is only updated daily. Jurgen described how heirarchical facets are available and also how users may create ‘shortlists’ of products they may be interested in – which are stored directly in Elasticsearch itself, acting as a simple NoSQL database. For me one of the interesting points from his talk was why Elasticsearch was chosen as a technology – it was tried during a hack day, found to be very easy to scale and to get started with and then quickly became a significant part of their technology stack. Some years ago implementing this kind of product search over tens of millions of items would have been a significant challenge (not to mention very expensive) – with Elasticsearch and other open source software this is simply no longer the case.

Networking and socialising continued into the evening, with live music in the pub downstairs. Thanks to everyone who came and in particular our two speakers. We’ll be back with another Meetup soon!

The post Cambridge Search Meetup – six degrees of ontology and Elasticsearching products appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/03/07/cambridge-search-meetup-six-degrees-of-ontology-and-elasticsearching-products/feed/ 0
Finding the elephant in the room: open source search & Hadoop grow closer together http://www.flax.co.uk/blog/2013/09/18/finding-the-elephant-in-the-room-open-source-search-hadoop-grow-closer-together/ http://www.flax.co.uk/blog/2013/09/18/finding-the-elephant-in-the-room-open-source-search-hadoop-grow-closer-together/#comments Wed, 18 Sep 2013 10:56:55 +0000 http://www.flax.co.uk/blog/?p=1020 I’ve been lucky enough to attend two talks on Hadoop in the last few weeks which has made me take a closer look at this technology. In case you didn’t know, Hadoop is an Apache top level open source project … More

The post Finding the elephant in the room: open source search & Hadoop grow closer together appeared first on Flax.

]]>
I’ve been lucky enough to attend two talks on Hadoop in the last few weeks which has made me take a closer look at this technology. In case you didn’t know, Hadoop is an Apache top level open source project comprising a framework for distributed computing and storage, originally created by Doug Cutting (also the creator of Apache Lucene) while at Yahoo! in 2005. Distributed computing is carried out using MapReduce (roughly speaking, the ‘map’ bit involves splitting a processing task up into chunks and distributing these among various processing nodes, the ‘reduce’ bit brings all the results together again) and the storage uses the Hadoop Distributed File System (HDFS). There are other parts of Hadoop including a database (HBase), data warehouse with SQL-like language (Hive), scripting language (Pig) and more.

Those I’ve spoken to who have attempted to build applications on Hadoop have said that it’s very much a kit of parts rather than an integrated platform, so not that easy to get started with – which has led to the emergence of various vendors providing ‘curated’ distributions and support, much as Lucidworks does for Apache Lucene/Solr. Cloudera, Hortonworks, and MapR are just some of the best-known of these vendors. With everyone jumping on the BigData bandwagon these days some of these vendors have attracted significant interest and funding.

As you might expect full-text search is often required for these distributed systems and there have been various attempts to bring Hadoop and search closer together. Hortonworks support integration with Elasticsearch, although this currently appears to mean that you can use Hive or Pig to move data from Hadoop on or off a separate Elasticsearch cluster, rather than the search engine running on the cluster itself. Cloudera’s integration of Hadoop with Solr appears to be tighter, with Solr storing its indexes on HDFS directly (perhaps not surprising considering Lucene/Solr committer Mark Miller, who is responsible for most recent SolrCloud development, works for Cloudera). Cloudera even has its own data conditioning framework Flume (yes, it seems we need yet another data conditioning/pipelining solution!) and allows for distributed indexing. MapR have partnered with LucidWorks and integrated LucidWorks Search into their distribution. All these vendors are heavy contributors to Hadoop of course and most also contribute to Lucene/Solr or Elasticsearch.

Since Hadoop has been linked with search from the beginning one can hope that these integration efforts will continue – applications that require distributed search are becoming increasingly common and Hadoop, despite its nature as a kit of parts requiring assembly, is a good foundation to build on.

The post Finding the elephant in the room: open source search & Hadoop grow closer together appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2013/09/18/finding-the-elephant-in-the-room-open-source-search-hadoop-grow-closer-together/feed/ 2
Cambridge Search Meetup – Search for publication success and low-cost apps http://www.flax.co.uk/blog/2012/10/18/cambridge-search-meetup-search-for-publication-success-and-low-cost-apps/ http://www.flax.co.uk/blog/2012/10/18/cambridge-search-meetup-search-for-publication-success-and-low-cost-apps/#respond Thu, 18 Oct 2012 09:45:45 +0000 http://www.flax.co.uk/blog/?p=878 After a short break the Cambridge Search Meetup returned last night with our usual mix of presentations, questions, networking, beer and snacks. We had a few issues with the projector and cables (one of these is on the shopping list … More

The post Cambridge Search Meetup – Search for publication success and low-cost apps appeared first on Flax.

]]>
After a short break the Cambridge Search Meetup returned last night with our usual mix of presentations, questions, networking, beer and snacks. We had a few issues with the projector and cables (one of these is on the shopping list for next time) so thanks to both presenters and audience for their patience!

First up was Liang Shen with a description of Journal Selector, a system for helping those publishing academic papers to find the correct journals to approach. The system allows one to copy and paste a chunk of a paper to a website and find which journals best match the subject matter, based on what they have published in the past. Running on the Amazon EC2 cloud the service indexes journals from feeds, HTML webpages and other sources, processes and stores this data in Amazon’s Hadoop-compatible database, indexes it with Apache Solr and then presents the results via the Drupal CMS. The results are impressive, allowing users to see exactly on what basis the system has recommended a journal to approach. You can see the presentation slides here.

Next was Rich Marr, who bravely offered to live-code a demonstration of his low-cost prototyping methodology for startups needing both NoSQL data storage and search across this data. In only 20 lines or so of code he showed us how to use Node.js to build a simple server that could accept messages (over Telnet, although HTTP or even IMAP would be as easy), store them in a CouchDB database and index them for searching (using a different message) with Elasticsearch. Rich’s demo prompted a lively discussion of how commoditized and componentized search technology is becoming, with open source components that allow one to build a prototype search engine in minutes.

Thanks to both our speakers – and the Meetups continue, with Rich Marr’s own London Open Source Search Social meeting on Tuesday 23rd October, and in Cambridge the Data Insights Meetup where I’ll be talking on November 1st.

The post Cambridge Search Meetup – Search for publication success and low-cost apps appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/10/18/cambridge-search-meetup-search-for-publication-success-and-low-cost-apps/feed/ 0