migration – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Out with the old – and in with the new Lucene query parser? http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/#respond Fri, 13 May 2016 12:41:52 +0000 http://www.flax.co.uk/?p=3273 Over the years we’ve dealt with quite a few migration projects where the query syntax of the client’s existing search engine must be preserved. This might be because other systems (or users) depend on it, or a large number of … More

The post Out with the old – and in with the new Lucene query parser? appeared first on Flax.

]]>
Over the years we’ve dealt with quite a few migration projects where the query syntax of the client’s existing search engine must be preserved. This might be because other systems (or users) depend on it, or a large number of stored expressions exist and it is difficult or uneconomic to translate them all by hand. Our usual approach is to write a query parser, which understands the current syntax but creates a query suitable for a modern open source search engine based on Apache Lucene. We’ve done this for legacy engines including dtSearch and Verity and also for in-house query languages developed by clients themselves. This allows you to keep the existing syntax but improve performance, scalability and accuracy of your search engine.

There are a few points to note during this process:

  • What appears to be a simple query in your current language may not translate to a simple Lucene query, which may lead to performance issues if you are not careful. Wildcards for example can be very expensive to process.
  • You cannot guarantee that the new search system will return exactly the same results, in the same order, as the old one, no matter how carefully the query parser is designed. After all, the underlying search engine algorithms are different.
  • Some element of manual translation may be necessary for particularly large, complex or unusual queries, especially if the original intention of the person who wrote the query is unclear.
  • You may want to create a vendor-neutral query language as an intermediate step – so you can migrate more easily next time. We’ve done this for Danish media monitors Infomedia.
  • If you have particularly large and/or complex queries that may have been added to incrementally over time, they may contain errors or logistical inconsistencies – which your current engine may not be telling you about! If you find these you have two choices: fix the query expression (which may then give you slightly different results) or make the new system give the same (incorrect) results as before.

To mitigate these issues it is important to decide on a test set of queries and expected results, and what level of ‘correctness’ is required – bearing in mind 100% is going to be difficult if not impossible. If you are dealing with languages outside the experience of the team you should also make sure you have access to a native speaker – so you can be sure that results really are relevant!

Do let us know if you’re planning this kind of migration and how we can help – building Lucene query parsers is not a simple task and some past experience can be invaluable.

The post Out with the old – and in with the new Lucene query parser? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/feed/ 0
Out and about in January and February http://www.flax.co.uk/blog/2015/01/27/out-and-about-in-january-and-february/ http://www.flax.co.uk/blog/2015/01/27/out-and-about-in-january-and-february/#respond Tue, 27 Jan 2015 11:08:39 +0000 http://www.flax.co.uk/blog/?p=1360 We’re speaking at a couple of events soon: if you’re in London and interested in Apache Lucene/Solr we’re also planning another London User Group Meetup soon. Firstly my colleague Alan Woodward is speaking with Martin Kleppman at FOSDEM in Brussels … More

The post Out and about in January and February appeared first on Flax.

]]>
We’re speaking at a couple of events soon: if you’re in London and interested in Apache Lucene/Solr we’re also planning another London User Group Meetup soon.

Firstly my colleague Alan Woodward is speaking with Martin Kleppman at FOSDEM in Brussels (31st January-1st February) on Searching over streams with Luwak and Apache Samza – about some fascinating work they’ve been doing to combine the powerful ‘reverse search’ facilities of our Luwak library with Apache Samza‘s distributed, stream-based processing. We’re hoping this means we can scale Luwak beyond its current limits (although those limits are pretty accomodating, as we know of systems where a million or so stored searches are applied to a million incoming messages every day). If you’re interested in open source search the Devroom they’re speaking in has lots of other great talks planned.

Next I’m talking about the wider applications of this kind of reverse search in the area of media monitoring, and how open source software in general can help you turn your organisation’s infrastructure upside down, at the Intrateam conference event in Copenhagen from February 24th-26th. Scroll down to find my talk at 11.35 am on Thursday 26th.

If you’d like to meet us at either of these events do get in touch.

The post Out and about in January and February appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/01/27/out-and-about-in-january-and-february/feed/ 0
Enterprise Search Europe 2014 day 1 – Decisions, research and a Meetup quiz http://www.flax.co.uk/blog/2014/05/01/enterprise-search-europe-2014-day-1-decisions-research-and-a-meetup-quiz/ http://www.flax.co.uk/blog/2014/05/01/enterprise-search-europe-2014-day-1-decisions-research-and-a-meetup-quiz/#respond Thu, 01 May 2014 15:59:38 +0000 http://www.flax.co.uk/blog/?p=1185 This year’s Enterprise Search Europe was held near Victoria train station in London and unfortunately coincided with a two day strike on the London Underground – worrying for the organisers, but apart from a few notable absences it didn’t seem … More

The post Enterprise Search Europe 2014 day 1 – Decisions, research and a Meetup quiz appeared first on Flax.

]]>
This year’s Enterprise Search Europe was held near Victoria train station in London and unfortunately coincided with a two day strike on the London Underground – worrying for the organisers, but apart from a few notable absences it didn’t seem to affect the attendance too much. We started with a keynote from Dale Roberts, whose book on Decision Sourcing inspired a talk about a ‘rational decision making model’. When examining traditional relational database applications Dale said ‘if you peer at it long enough you can see the rows and columns’ and his point was that modern consumer social networking applications don’t exhibit this old pattern – so this is where search application designers should look for inspiration. His co-presenter Rooven Pakkiri said that Enterprise Search should attempt to ‘release the information from inside our heads’, which of course social networking might help with, connecting you with colleagues. I’m not sure that one can easily take lessons learnt from consumer applications and apply them to business use, and some later speakers agreed with me, but this was a high-energy and thought-provoking start.

Next I chaired the Open Source track, where we started with Cedric Ulmer of France Labs, who talked about a search application they built for a consultancy business with around 40 employees. Using Apache Solr, Apache ManifoldCF and their own Datafari open source framework they turned this project around very quickly – interestingly, the end clients needed no training to use the new system, which implies a very well designed UI. Our second talk from Ronald Hobbs of Reed Business International described a project on a much larger scale: 100 million documents, 72 business units and up to 190 queries per second – this was originally served by the FAST ESP engine but they moved to an Apache Solr system, replacing the FAST processing pipeline with Search Technologies Aspire project. His five steps for an effective migration (Prepare, Get the right tools, Get the right team, Migrate in chunks, Clean up) I can only agree with from our own experience of such projects, including one from FAST ESP to Solr. I was amused by his description of the Apache Zookeeper project as ‘a bipolar manic depressive’, although it seemed this was eventually overcome with a successful deployment on Amazon EC2. Next was Galina Hinova of Intrafind on a aftersales search application for MAN Truck and Bus – again at serious scale (MAN have around 1 billion vehicles in existence with 100-150 documents related to each). Interestingly the Euro6 regulations for emissions and standardized EU terms for automobile parts were direct drivers of the project, with Apache Lucene as the base technology. No longer is open source search just for small-scale projects it seems!

After a short break during which I chatted to John Newton, founder of Documentum Alfresco, and his team we returned to hear Dan Jackson give a description of how UCL had improved their website search – with a chaotic mix of low quality content and an ‘awful’ content management system, the challenges were myriad but with the help of experts such as our associate Tony Russell-Rose they have made significant improvements. Next was what was to prove a very popular talk from Nick Brown of AstraZeneca on a huge, well funded project to build applications to support research and development – again, this was at large scale with 75 million documents (including ‘all the patents and all the research papers’). The key here was their creation of many well-targeted ‘apps’ to enable particular uses of the Sinequa search engine they chose for the back end, including mobile apps to help find others in the company (or external to it) who are also working on a particular drug or disease. This presentation showed just what can be achieved if companies really understand the potential of search technology – knowledge sharing and discovery of previously unknown information.

After a short drinks reception we retired to a nearby pub for the combined Cambridge and London Search Meetup – I’d prepared a short quiz (feel free to have a go!) which was won by Tony Russell-Rose’s team. Networking and chatting continued long into the evening, with some people from the wider UK search community also attending.

To be continued! You can see most of the slides here.

The post Enterprise Search Europe 2014 day 1 – Decisions, research and a Meetup quiz appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/05/01/enterprise-search-europe-2014-day-1-decisions-research-and-a-meetup-quiz/feed/ 0
Three reasons why your search may be prehistoric http://www.flax.co.uk/blog/2013/08/05/three-reasons-why-your-search-may-be-prehistoric/ http://www.flax.co.uk/blog/2013/08/05/three-reasons-why-your-search-may-be-prehistoric/#respond Mon, 05 Aug 2013 10:18:02 +0000 http://www.flax.co.uk/blog/?p=990 ArnoldIT wondered today why we were bothering to announce an upgrade to the venerable dtSearch engine, when they “weren’t aware of too many people still using that software”. Perhaps it’s time for a quick reality check here – we regularly … More

The post Three reasons why your search may be prehistoric appeared first on Flax.

]]>
ArnoldIT wondered today why we were bothering to announce an upgrade to the venerable dtSearch engine, when they “weren’t aware of too many people still using that software”. Perhaps it’s time for a quick reality check here – we regularly see clients with search engines that many would consider prehistoric still in active use. Here’s some reasons why that might be so:

  • Search isn’t seen as essential. If your accounting software goes down, nobody gets paid: but if the search engine has gradually degraded in accuracy, doesn’t always contain the most recent documents and is generally too hard to use then most of your users will try and find a way around it – they’ll Google for content on the corporate website, dig slowly through the filestores or call up a colleague to ask. Of course, all of this will take time and there’s the risk they won’t find anything useful (or worse, find something inaccurate or out-of-date), but time is only money, surely?
  • The magic has gone. The sharp suited salesman who told you all the magical things your search engine could do – it could understand concepts, human language and the meaning of life – is a distant memory. Somehow those magical features were never implemented, perhaps the unexpected extra cost put you off (surely the magic came as standard? No?). You’ve also probably turned off a lot of the clever features of your engine as either no-one could understand how to use them, or they affected performance so much that search results took minutes to appear.
  • Upgrading search is hard and expensive. Small changes to the existing engine can cost huge consultancy fees but if you change supplier, you’ll have a whole new team of salesmen to meet, lots more buzzwords to learn, there’s expensive new license fees to pay, you’ll also have to overhaul your content management system, your metadata, your front ends…better to leave everything alone, surely?

There are search engines out there, chugging away quietly behind a corporate firewall, whose antiquity would astonish. Any chance of a support contract has long gone as the supplier would prefer it if you upgraded to their latest-and-greatest version – that’s if the supplier still exists at all. However there is always a way to upgrade that reduces the risk and cost – an incremental, agile and open-source based approach will prevent future lock-in to a single supplier and give you more control of the code your search engine depends on. Recently we’ve used this approach to help clients successfully upgrade search applications based on dtSearch, FAST ESP and Oracle and in the near future we’ll be doing the same for clients with several other well-known engines – and a few lost in the mists of time!

The post Three reasons why your search may be prehistoric appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2013/08/05/three-reasons-why-your-search-may-be-prehistoric/feed/ 0
Trading-up to open source – a safer route to effective search http://www.flax.co.uk/blog/2012/12/05/trading-up-to-open-source-a-safer-route-to-effective-search/ http://www.flax.co.uk/blog/2012/12/05/trading-up-to-open-source-a-safer-route-to-effective-search/#respond Wed, 05 Dec 2012 12:13:42 +0000 http://www.flax.co.uk/blog/?p=918 It hasn’t taken long for some of Autonomy’s rivals to attempt to capitalise on the recent bad PR around HP’s acquisition – OpenText has offered a ‘software trade-in’, Recommind has offered a ‘trade-up’ and Swiss company RSD has offered a … More

The post Trading-up to open source – a safer route to effective search appeared first on Flax.

]]>
It hasn’t taken long for some of Autonomy’s rivals to attempt to capitalise on the recent bad PR around HP’s acquisition – OpenText has offered a ‘software trade-in’, Recommind has offered a ‘trade-up’ and Swiss company RSD has offered a free license for their governance software to Autonomy customers. No word yet from Exalead, Oracle (Endeca), Microsoft (FAST) or any of the other big commercial search companies but I’m sure their salespeople are making the most of the situation.

Migrating a search engine from one technology to another is rarely trouble-free: data must be re-indexed, query architectures rewritten, integration with external systems re-done, relevancy checked…however with sufficient forethought it can be done successfully. We’ve just helped one client migrate from a commercial engine to Apache Solr in a matter of weeks: although at first glance Solr didn’t seem to support all of the features the commercial engine provided, it proved possible to simulate them using multiple queries and with careful design for scalability, query performance is comparable.

Choosing one closed source engine to replace another doesn’t remove the risk that future corporate mergers & acquisitions will cause exactly the same lack of confidence that is no doubt affecting Autonomy customers – or huge increases in license fees, a drop in the quality of available support or the end of the product line altogether – and we’ve heard of all of these effects over the last few years. Moving to an open source search engine gives you freedom and control of the future of the technology your business is reliant upon, with a wealth of options for migration assistance, development and support.

So here’s our offer – we’d be happy to talk, for free (by phone or face-to-face for customers within reach of our Cambridge offices), to any Autonomy customers considering migration and to help them consider the open source options (some of these even have the Bayesian, probabilistic search features Autonomy IDOL provides) – and together with our partners we can also provide a level of ongoing support comparable to any closed source vendor. We don’t have salespeople, we don’t have a product to sell you and you’ll be talking directly to experts with decades of experience implementing search – and there’s no obligation to take things any further. We’d simply like to offer an alternative (and we believe, safer) route to effective search.

The post Trading-up to open source – a safer route to effective search appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/12/05/trading-up-to-open-source-a-safer-route-to-effective-search/feed/ 0
Google Search Appliance version 7 – too little too late? http://www.flax.co.uk/blog/2012/10/10/google-search-appliance-version-7-too-little-too-late/ http://www.flax.co.uk/blog/2012/10/10/google-search-appliance-version-7-too-little-too-late/#respond Wed, 10 Oct 2012 12:26:35 +0000 http://www.flax.co.uk/blog/?p=862 Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one … More

The post Google Search Appliance version 7 – too little too late? appeared first on Flax.

]]>
Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one though).

Although there’s a raft of new features, most of them have been introduced by the GSA’s competitors over the last few years or are available as open source (entity recognition or document preview for example). The GSA is also not a particularly cheap option as commentators including Stephen Arnold have noticed: we’ve had clients tell us of six-figure license fees for reasonably sized collections of a few millions of documents – and that’s for two years, after which time you have to buy it again. Not surprisingly some people have migrated to other solutions.

However there’s another question that seems to have been missed by Google’s strategists: how a physical appliance can compete with cloud-based search. I can’t think of a single prospective client over the last year or so who hasn’t considered this latter option on both cost and scalability grounds (and we’ll shortly be able to talk about a very large client who have chosen this route). Although there may well be a hard core of GSA customers who want a real box in reassuring Google yellow, one wonders why Google haven’t considered a ‘virtual’ GSA to compete with Amazon’s CloudSearch amongst others.

It will be interesting to see if this version of the GSA is the last…

The post Google Search Appliance version 7 – too little too late? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/10/10/google-search-appliance-version-7-too-little-too-late/feed/ 0
An open source replacement for the dtSearch closed source search engine http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/ http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/#comments Tue, 24 Apr 2012 09:00:48 +0000 http://www.flax.co.uk/blog/?p=741 We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a … More

The post An open source replacement for the dtSearch closed source search engine appeared first on Flax.

]]>
We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a monitoring application) they were keen that the new engine spoke exactly the same query language as the old – so we’ve built a version of Apache Lucene to replace dtSearch. There are a few other modifications we had to do as well, to return such things as positional information from deep within the Lucene code (this is particularly important in monitoring as you want to show clients where the keywords they were interested in appeared in an article – they may be checking their media coverage in detail, and position on the page is important).

First, we developed a new Lucene Analyzer that speaks the same syntax as dtSearch, allowing us to index text input. On the search side we have a Lucene QueryParser that shares this syntax. To make it easier to use we’ve wrapped the whole lot in a modified Solr server. As we needed some features of very recent Lucene code, our modifications are based on a patch to Lucene trunk (and so the source code isn’t for the faint hearted – if you need it let us know, but we’re not currently providing it for download).

We’re not sure if there’s anyone else out there who needs an open source alternative to dtSearch – but in case there please contact us.

UPDATE: We’ve had many people contact us in the 6 years since this article was written asking for the query parser code – I’m afraid the original code is very out of date and certainly wouldn’t work with modern Lucene or Lucene-based search engines like Solr/Elasticsearch. We’re thus not able to provide it. However, if you do have a dtSearch migration project we may be able to help you on a consultancy basis (we have carried out several similar projects for our clients) – do contact us for details.

More generally, what this project demonstrates is that even if you have significant investment in your existing search infrastructure it is entirely possible to move to an open source alternative, which may be faster and will almost certainly be more economically scalable. Does anyone else have a search engine they’d like to replace?

The post An open source replacement for the dtSearch closed source search engine appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/feed/ 6
How not to make the same mistake twice http://www.flax.co.uk/blog/2010/12/06/how-not-to-make-the-same-mistake-twice/ http://www.flax.co.uk/blog/2010/12/06/how-not-to-make-the-same-mistake-twice/#respond Mon, 06 Dec 2010 10:18:24 +0000 http://www.flax.co.uk/blog/?p=447 We’ve been aware that some FAST customers will be considering migration for a while now – but Autonomy have finally caught up. However, if you migrate from one closed source solution to another, how can you guarantee that the same … More

The post How not to make the same mistake twice appeared first on Flax.

]]>
We’ve been aware that some FAST customers will be considering migration for a while now – but Autonomy have finally caught up.

However, if you migrate from one closed source solution to another, how can you guarantee that the same sort of events that have led to the current situation won’t happen again? With open source, there’s no vendor lock-in, a wide choice of companies to assist you with development an integration, a wealth of different support options and of course no license fees to pay. Migrating from FAST is a common topic at conferences at the moment – read Jan Høydahl’s presentation, or see Michael McIntosh’s video. There are even open source document processing pipeline frameworks to replace the popular FAST one, and we’ve been evaluating some alternative language processing frameworks. Scaling isn’t an issue and some cases you could significantly reduce your hardware budget.

The post How not to make the same mistake twice appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2010/12/06/how-not-to-make-the-same-mistake-twice/feed/ 0