Solr has multiple ways of dealing with parent-child relationships, and each have their own features and drawbacks. Classic Joins have well documented performance issues, especially on larger datasets. Block-Joins require additional indexing work and still increase document count which affects performance to a lesser degree. Full denormalization (having only parents as Solr documents) can be difficult to implement without losing functionality. Recently, we engaged with a client to move them from a Block-Join configuration to a full denormalized solution. Here are some of the tips and tricks to consider if you are looking to make a similar move.
What is Denormalization?
Denormalization in Solr (and more commonly in databases) is the process of moving from multiple document types linked by joins to a single document type containing all relevant information. When dealing with parent-child relationships, this is usually done one of two ways: Copying parent information onto child documents, or incorporating child information into parent documents. Pushing parents into child documents will result in more overall documents (assuming there are many more child than parent documents) which may affect performance. Facet calculations may also become more expensive or inaccurate if facets are meant to count parent documents. Incorporating child documents into parents can significantly improve performance, but involves handling many complex scenarios, and may be impossible for some cases.
For this post, we’ll focus on the latter case.
A well tuned denormalized index using parent documents can use a fraction of the resources of a normalized index, or an indexed denormalized to child documents. A parent-denormalized index can be stored in less than a quarter of the disk space (down to 1/20th in at least one example) depending on complexity. Reduced index size will also likely improve indexing time, and the reduced document count can eliminate the need for sharding which simplifies architecture requirements. For one client, 4 shards at 20GB each was reduced to a single ~4GB shard, along with indexing time going from 90 minutes (with many optimizations) to 7 minutes (without any optimization work). Overall query throughput and latency will also improve. Querying against only parent documents greatly decreases the footprint Solr needs to handle a request. Joins also require alternate query parsers which can be difficult to combine with complex queries. In general, large improvements should be expected.
Handling Independent Child Attributes
The first and easiest case to handle is child attributes that are independent. Independence means that when searching for child attributes x and y, it’s important that some child document has each but not necessarily that the same child document have both. A GPS device may have adapter:usb, adapter:lighter, attachment:windshield, attachment:dashboard as part of its fit data, but filtering on adapter works the same regardless of whether you filter on attachment. This case can be handled by simply indexing each value from each child document into a single multi-valued field. These flattened fields can then be used in the same manner as the regular parent fields.
Handling Dependent Child Attributes
For fields that interact, more care needs to be taken. If a user wants a part matching a 2008 Ford F-150, showing parts that match a 2008 Dodge Durango and a 2010 Ford F-150 (but not a 2008 Ford F-150) will be a bad experience. The first step should be to isolate groups of dependent fields. Making sure a product matches a given use may involve several fields, but will likely be independent of price. Price itself may depend on store (more on that later) but not other fields.
A convenient tool for handling these scenarios is the Solr PathHierarchyTokenizerFactory. By indexing multiple fields combined with a separator (year/make/model/submodel) you can perform filtering and faceting on any incomplete specification (year/make/model, year/make). To work, this requires the ability to order the field values, and use them only in that order. Allowing for a couple variations is easy; simply create new path fields for them. However, trying to allow for every variation will result in a combinatorial explosion that quickly becomes unusable (see When to Avoid Denormalizing section). Overall, the more you can break down requirements to use a few specific orders, the easier this process will be.
Handling Semi-Dependent Attributes
Using PathHierarchyTokenizerFactory can also handle independent fields that share a common set of dependent fields. For example, shipping availability and price may both vary by store, but list price won’t change based on shipping availability. By indexing each “fieldname/fieldValue” pair at the end of the dependent chain (say “year/make/model”) these become effectively filterable, limited based on the initial chain, and independent of each other. You can search for a free shipping product from store 256 with “fq=availability_field:256/free_shipping” and then separately limit it by price (mentioned below).
If you are moving to Solr from a database, the existing database design will likely tell you which fields are dependent on which based on table structure. It’s worth noting composite ids for child documents, as these will likely be the dependent paths that other attributes will follow as suffixes.
Special case: Price and other continuous attributes
If you have attributes that are continuous, ranged faceting can become difficult. One specific and common example of this is price by store. In order to use a path field or separator, you need to have a text field. However, this prevents numerical ranges from working correctly. In addition, you want to see all prices in a given range only for the store(s) the user selected. One solution to this dilemma is to use a path field containing storeId/price but with price left-padded with zeros. A $5.99 and a $199.99 product in store 256 would be “256/0000005.99” and “256/0000199.99” respectively. Ranged facets (using the [start TO end] syntax) will then operate as desired based on the default lexicographic filtering: fq=storePrice:[256/0000004.99 TO 256/0000009.99] will match products with a store 256 price between $5 and $10, but not other store prices in that range. This will require some UI work to interpret prices correctly, but it can save the index from a large number of bloating price documents.
Retrieving child fields
One difficulty of denormalization is retrieving information from a specific child document. While a given product will likely have one price for a given store, it may have hundreds of store-price field entries. Simply returning the entire list via the fl Solr property will involve a lot of serialization/deserialization and network traffic. Solr Highlighting can be used as an effective alternative method.
The highlighting feature in Solr is generally used for showing where searches match for a given document. By setting the highlight query to the value you want (2008/ford/f-150/* for vehicle attributes, or 256/* for store price) you can match only the correct value and avoid returning the irrelevant ones. It is recommended to use termVectors for increased performance. TermVectors will allow Solr to highlight the original text without having to re-analyze the entire contents of the field (which may be large depending on field).
When to Avoid Denormalizing
All of this benefit is, of course, predicated on the parent-denormalized architecture working for your needs. Here are a few things to watch out for when considering denormalizing:
Child documents are searched directly
If users search for child documents (meaning a list of child documents is returned) that may be incredibly difficult to replicate using parent-denormalized documents. Consider using faceting to support this, but otherwise an alternative design may be necessary.
Child document fields can’t be ordered
When users may specify dependent child attributes by any of several fields, combinatorial explosion will likely make this method untenable. Up to three elements (with every possible order, A/B/C, A/C/B, B/A/C, B/C/A, C/A/B, C/B/A) can be reasonably supported but going beyond that quickly becomes difficult to manage (24 fields for every order of 4 elements).
Child document structure changes frequently
Getting parent-denormalization to function well usually requires some careful setup. If the essential structure of the child documents changes significantly, much of this work may need to be re-done. Frequent structure changes could mean more time spent on re-engineering than on new features. Some of this will be inevitable, with significant changes required in existing systems to handle updates, but a parent-denormalized architecture is particularly sensitive to them. For most datasets this will not be an issue, as extremely volatile datasets are uncommon and difficult to work with in any case.
Small index size
If your index is very small, ~100K-800K documents or so counting child documents, then many of the performance improvements will be modest. It may not be worth the additional effort to set up a parent-denormalized index. However, if you expect your index size to scale out of that range then it would be worth considering denormalization before scale becomes a problem.
Denormalization is a powerful tool to increase performance and get more out of Solr. While it isn’t appropriate for all datasets and scenarios, careful analysis and index setup can help this model work in a surprising array of situations. Even if you don’t have parent-child relationships, some of these techniques may help you with unrelated problems. Learning advanced techniques like these may be intimidating, but they can help you get the most out of your Solr installation.