JSON has been the lingua franca of data exchange for many years. It’s human-readable, lightweight and widely supported. However, the JSON spec does not define what parsers should do when they encounter a duplicate key in an object, e.g.:
{ "foo": "spam", "foo": "eggs", ... }
Implementations are free to interpret this how they like. When different systems have different interpretations this can cause problems.
We recently encountered this in an Elasticsearch project. The customer reported unusual search behaviour around a boolean field called draft. In particular, documents which were thought to contain a true value for draft were being excluded by the query clause
{ "query": "bool": { "must_not": { "term": { "draft": false } }, ...
The version of Elasticsearch was 2.4.5 and we examined the index with Sense on Kibana 4.6.3. The documents in question did indeed appear to have the value
{ "draft": true, ... }
and therefore should not have been excluded by the must_not query clause.
To get to the bottom of it, we used Marple to examine the terms in the index. Under the bonnet, the boolean type is indexed as the term “T” for true and “F” for false. The documents which were behaving oddly had both “T” and “F” terms for the draft field, and were therefore being excluded by the must_not clause. But how did the extra “F” term get in there?
After some more experimentation we tracked it down to a bug in our indexer application, which under certain conditions was creating documents with duplicate draft keys:
{ "draft": false, "draft": true ... }
So why was this not appearing in the Sense output? It turns out that Elasticsearch and Sense/Kibana interpret duplicate keys in different ways. When we used curl instead of Sense we could see both draft items in the _source field. Elasticsearch was behaving consistently, storing and indexing both draft fields. However, Sense/Kibana was quietly dropping the first instance of the field and displaying only the second, true, value.
I’ve not looked at the Sense/Kibana source code, but I imagine this is just a consequence of being implemented in Javascript. I tested this in Chrome (59.0.3071.115 on macOS) with the following script:
<!DOCTYPE html> <html> <head></head> <body> <script> var o = { s: "this is some text", b: true, b: false }; console.log("value of o.b", o.b); console.log("value of o", JSON.stringify(o, "", 2)); </script> </body> </html>
which output (with no warnings)
value of o.b true test.html:13 value of o { "s": "this is some text", "b": true }
(in fact it turns out that order of b doesn’t matter, true always overrides false.)
Ultimately this wasn’t caused by any bugs in Elasticsearch, Kibana, Sense or Javascript, but the different way that duplicate JSON keys were being handled made finding the ultimate source of the problem harder than it needed to be. If you are using the Kibana console (or Sense with older versions) for Elasticsearch development then this might be a useful thing to be aware of.
I haven’t tested Solr’s handling of duplicate JSON keys yet but that would probably be an interesting exercise.