I was recently helping one of our many FindTuner clients to address an issue that had been discovered after updating to Solr 6.2. Originally a search for a product, where products were being added through merchandising strategies into the results, would return 50 results (49 organically, and 1 artificially added). After the upgrade the search now was returning nothing at all. The underlying query was very simple:
- q=shoes OR id:123
While the query for shoes originally returned 49 products, upon adding the "OR id:123" clause we received 0 results. Quite puzzling on the surface as any boolean OR clause should, at the very least, simply union more data into the results...certainly not reduce data. After deconstructing the query to the most basic components, it became apparent that the handling of Boolean behavior had changed.
Solr Boolean Queries pre-5.5
The Extended Dismax Query Parser (or edismax, as we all call it) has been a staple of the Solr distribution since Solr 3.1, which dates to March 2011. Designed to merge the best of both worlds from the Lucene Query Parser and the Dismax Query Parser, it has been the workhorse for many search solutions that rely on good relevancy, and a "Google-like" search experience.
One of the important features of edismax was the addition of the use of Boolean Query behavior to Dismax. Through a combination of both, it was possible to have a very simple query net precisely the right results. Take for example:
- q=blue OR category:shoe
In this case, blue is searched for in all of the fields defined by the qf parameter and shoes is searched for only in the category field, with the results unioned together. Complexity could be handled by the boolean operators ( and ) to create very refined examples:
- q=shoes AND ((size:large OR size:medium) AND (color:red OR color:blue))
(Take these examples with a grain of salt...the approach is crude, but it does give a simple view of how it works.)
Now, I've been keeping my eye on a very old Solr JIRA ticket for quite a while: SOLR-2649. One serious consequence of using Boolean query behavior in the edixmax parser was always that the Minimum Should Match (mm) behavior of edismax would be ignored in the presence of a boolean operator. To understand this a bit better, consider our first query:
- q=blue OR category:shoes
This would normally have retrieved either "blue" things, or the category of shoes. But compare this to the following query:
- q=blue suede OR category:shoes
Prior to the resolution of SOLR-2649 this would be interpreted as: blue OR suede OR shoes. If you had set your default boolean operator to q.op=OR then you would have seen no issue...but this is not common. Most search solutions default to q.op=AND, and the results of this query would be highly incorrect. This can be resolved by any combination of:
- q=+blue +suede OR category:shoes
- q=(blue AND suede) OR category:shoes
In reality, this ignored the Minimum Should Match behavior entirely, but it was a workable solution for most queries. This is all well and good, but if the story ended there we wouldn't have much to talk about!
Solr Boolean Queries post-5.5
This is all up-ended now that SOLR-2649 has been resolved in Solr 5.5. To illustrate this, I grabbed the latest copies of the 5.4 and 5.5 branches (5.4.1 and 5.5.2) and ran queries side by side. If you'd like to follow along, here is what I did:
cd solr-5.4.1/bin
./solr start -e techproducts
Now that I have a very basic index up, I'll use the /browse handler which uses edismax and a 100% mm setting. I really like using this very simple data set since it provide an uncluttered way of looking at Solr behavior. I've been using this sample set to train hundreds of students for over five years...it has had very little change over the years and is very reliable.
With our 32 products from this simple data set I'll perform an OR query that should return both products
- http://localhost:8983/solr/techproducts/browse?q=id:GB18030TEST+OR+id:SP2514N
All is well. Now I'll establish a second instance of Solr using the 5.5.2 distribution:
cd solr-5.5.2/bin
./solr start -e techproucts
And perform the same base queries:
- http://localhost:8984/solr/techproducts/browse
- http://localhost:8984/solr/techproducts/browse?q=id:GB18030TEST+OR+id:SP2514N
Note that we get no products at all. There is no change to the underlying data, nor the underlying handler. We can, however, see the desired results simply by changing the mm parameter.
- http://localhost:8983/solr/techproducts/browse?q=id:GB18030TEST+OR+id:SP2514N&mm=1
To make sense of what's going on here we can review the additional work done in SOLR-8812 to firm this up. More importantly, SOLR-8812 describes in greater detail the inner workings of this new behavior. Greg Pendlebury adds a very helpful note to this ticket to describe in greater detail what is really going on.
The most critical part of Greg's comment is this:
So now that SOLR-2649 has come along, it slightly muddies the water because:
- q.op is no longer hard coded to OR. Pre-patch the user could say q.op=AND, but it didn't do anything to the query
- The presence of an operator no longer turns off the mm feature
This actually brings the query behavior into a much more rational frame of view. The mm parameter now becomes the first-class citizen of the edismax behavior, as it should, and governs how boolean behavior (which is still fully supported) operates. Consider the following query (remembering that we are still using mm=100%):
- http://localhost:8984/solr/techproducts/browse?q=electronics+(6H500F0+OR+IW-02)
This is functionally the same as:
- http://localhost:8984/solr/techproducts/browse?q=electronics+OR+(6H500F0+OR+IW-02)
- http://localhost:8984/solr/techproducts/browse?q=electronics+AND+(6H500F0+OR+IW-02)
On the surface, the fact that the AND an OR following the word electronics seems to be ignored is a bit alarming, but where this makes a great deal more sense is to compare the following:
- http://localhost:8984/solr/techproducts/browse?q=electronics+(ipod+OR+IW-02)
- http://localhost:8984/solr/techproducts/browse?q=electronics+(ipod+AND+IW-02)
Receiving 3 and 1 products respectively. We can begin to dissemble the behavior as much as we like by a combination of boolean operators matched to mm parameter values. It really permits the mm parameter to shine and still allow us to access boolean logic in our queries if we need to.
But it's not a perfect solution, and it's not likely to ever be a perfect solution. The Disjunction Maximum concept is a bit sloppy, favoring recall over boolean logic's exacting precision. When mixing the two together, we really are attempting to create gold from search alchemy.
One final thought to consider from Greg's comment in SOLR-8812:
"If the user has a use case that includes both boolean parameters and mm logic... have fun."
Looks like I'm going to have a lot of fun!
Comments