Pivot Faceting (Decision Trees) in Solr 1.4.
Solr Pivot Facets
Solr faceting breaks down searches for terms, phrases, and fields in the Solr into aggregated counts by matched fields or queries. Facets are a great way to “preview” further searches, as well as a powerful aggregation tool in their own right.
Before Solr 4.0, facets were only available at one level, meaning something like “counts for field ‘foo’” for a given query. Solr 4.0 introduced pivot facets (also called decision trees) which enable facet queries to return “counts for field ‘foo’ for each different field ‘bar’” – a multi-level facet across separate Solr fields.
Decision trees come up a lot, and at work, we need results along multiple axes – typically in our case “field/query by year” for a time series. However, we use Solr 1.4.1 and are unlikely to migrate to Solr 4.0 in the meantime. Our existing approach was to simply query for the top “n” fields for a first query, then perform a second-level facet query by year for each field result. So, for the top 20 results, we would perform 1 + 20 queries – clearly not optimal, when we’re trying to get this done in the context of a blocking HTTP request in our underlying web application.
Hoping to get something better than our 1 + n separate queries approach, I began researching the somewhat more obscure facet features present in Solr 1.4.1. And after some investigation, experimentation and a good amount of hackery, I was able to come up with a “faux” pivot facet scheme that mostly approximates true pivot faceting using Solr 1.4.1.
We’ll start by examining some real pivot facets in Solr 4.0, then look at the components and full technique for simulated pivot facets in Solr 1.4.1.
Pivot Faceting in Solr 4.0
Pivot facets were added to Solr in SOLR-792. A good introductory article is available on the Solr.pl site. To see the basic operation in action, let’s just use the “example” setup that comes with the Solr 4.0 distribution (located at “solr_4.0_path/solr/example”).
Let’s start the Solr process:
Next, we want to upload a series of documents. We’ll take the provided “exampledocs/books.csv” file, tweak it slightly and update via a CSV handler. The CSV format is: first line is field names, other lines are data. Note that I have changed some field names in the original sample “exampledocs/books.csv” file. The following should be written out to a new file, which I am calling “sample_books.csv”.
Note that we use _s
fields for simplicity, forcing string fields for
what would ordinarily be text fields – Solr facets only return results
on indexed, not stored terms, and string fields are identical for both.
In a real deployment, you would use copyField
directives to copy text
fields to string
fields for faceting.
We’ll upload this file to Solr using curl:
You should now be able to query the 10 sample documents at:
http://localhost:8983/solr/admin/form.jsp
.
Now that we have some documents to work with, let’s do a simple pivot query on price by genre:
(Note that I’ve added line breaks and escapes to show the parameters more clearly).
This gives us decision tree results for the facet_pivot
field:
Nice intuitive results, for a fairly straightforward facet query. However,
now to the bigger question – can we approximate this in Solr 1.4.1,
which doesn’t have the facet.pivot
query option?
Pivot Faceting in Solr 1.4.1
Solr 1.4.1 has much more limited facet support than 4.0. The building blocks that we will use to cobble together a “faux” pivot query are:
facet.field
: The normal facet field option.facet.query
: The normal facet query option.fq
: Basic field queries (for restrictions).- Local Params: We use a couple of Solr local parameters.
tag
: Tags afq
with an arbitrary name.key
: Tags a facet field an arbitrary name (instead of field name).ex
: Excludes taggedfq
’s from being operative on a given facet field/query.
Note that either facet fields or facet queries can be used with this technique – I’ll only show fields, but everything applies equally to queries.
Setup
At this point, you should take a Solr 1.4.1 distribution and set it up exactly as we did above for Solr 4.0 and upload our simple 10-document CSV file to the running server. For simplicity here (and to keep my head on straight), I ended up running my Solr 1.4.1 server on port 8984, so that I could also keep the Solr 4.0 server running on port 8983. Here’s what I did:
From here on, it is assumed you now have a populated Solr 1.4.1 server running on port 8984 (switch addresses / ports as appropriate for your actual setup).
Excluding Restrictions from Facets
The starting point for our pivot facets is excluding certain query restrictions for facets. A basic example is provided for tagging and excluding facets on the Solr wiki.
Let’s do a simple facet query on prices with a restriction of
genre_s:scifi
:
Looking to our results in facet_fields
, we see that we only have 2 hits
(numFound
), and the facet counts also add up to 2 (which represent our 2
SciFi books).
For situations like a drill down, Solr developers often want to run a basic
query with full restrictions for the returned records, but get more information
for facets. In this case, Solr allows tagging of fq
’s, and keys / excludes
on facets to conditionally remove fq
’s from a given facet only.
So, let’s tag our fq
as “SCIFI_FQ” and exclude it from our facet counts
with ex
, and then rename the facet results to “PRICE_KEY” using the
key
option:
Note that I have to escape the exclamation points and other characters for a command line example here. Now, let’s look at the results:
We can first see that the exclusion of the tagged “SCIFI_FQ” field query did
not affect the overall numFound
, which is still 2. However, for the
facet field we applied the exclusion to, we now have facet results for records
in the whole set (which is the effective query after the exclusion). Finally,
our facet field has been renamed “PRICE_KEY” instead of the field name
(“price_f”).
Constructing a Pivot Query
With the basic tag/key/exclude technique in mind, let’s now return to our original goal – create a pivot query on price by genre using Solr 1.4.1. We will do this by performing two queries:
- Perform a facet query for the top price results ordered by index.
- Create
fq
tagged exclusions for each facet result, then create multiple keyed facets on genre to give us each of our decision tree “leaf” results.
The first query is a very basic facet field query:
Which gives us four individual facet results: “5.99”, “6.99”, “7.95”, “7.99”.
We take each of those results and create specific fq
tagged restrictions:
Each excludes one of the components we’ll want facet results for our next
level field (genre) on. To then get the pivot facet result for each of our
four facets, we will exclude all the fq
’s above except the matching
one for the facet. Translating into facet parameters, this is:
The key is that we can specify multiple exclusions using a comma. Thus,
looking to the facet key “5.99_GENRE”, we exclude all the fq
restrictions
except “FQ5.99”, which means that the facet results for that facet field
key will be the facet counts for “fq=price_f:5.99
” only. It’s kind of a
twisted-double-negative logic, but it all works out.
Let’s put everything into our second-level query now:
Which gives us the “leaves” of the decision tree with our result keys: “5.99_GENRE”, “6.99_GENRE”, “7.95_GENRE”, and “7.99_GENRE”.
Looking at our original Solr 4.0 pivot query, we can cobble together our two Solr 1.4.1 queries to get an equivalent result. In the end, both produce the following decision tree for price by genre:
-
5.99: 2
- fantasy: 2
-
6.99: 3
- fantasy: 2
- scifi: 1
-
7.95: 1
- fantasy: 1
-
7.99: 4
- fantasy: 3
- scifi: 1
Victory!
Discussion and Practical Implications
Our “price by genre” example is a bit simplistic in that we can mostly get the same results with two standard Solr 1.4.1 facet field queries. But, the faux pivot facet technique really shines for a “foo by bar”-type query where there are large numbers of first (“foo”) level facet results. Say, the first level has 10 results, this would mean 11 queries (one for the top 10 “foo”’s, then one each for the 10 second-level “bar”’s for each “foo”). The faux pivot facet technique cuts this down to 2 queries total.
The method is generally applicable too. Although our examples here only use facet fields, the technique equally works for facet queries. And distributed search supports the approach as well.
Looking further, the technique can be applied to additional decision tree
levels. In the Solr 4.0 world, this simply means adding another field
like facet.pivot=price_f,genre_s,inStock_b
to get further breakdowns for
the “in stock” boolean field. For Solr 1.4.1, we would do a third query,
with permutations of our previous tagged fq
’s, as well as second-level
fq
’s. Then, we would have third-level keyed facet fields like:
“6.99_fantasy_INSTOCK”, “6.99_scifi_INSTOCK”, etc. At this level, it
certainly wouldn’t be pretty and would result in a beastly query, but
shows that the technique only adds 1 more actual Solr query for each
additional level in the faux decision tree.
However, on the topic of query complexity, it is fair to point out that this type of query hackery really should be done programmatically to ensure correctness, and definitely not via the manual queries I provided above using curl. It’s tough keeping track of just the 4 first-level pivots in our example above, let alone a larger first level group, or more than 2 levels deep of pivots. Another benefit is that you can collapse your tag / key names to integer or other simple keys, and then have your program match things up later for the final assembled decision tree result.
As a final performance note, the faux pivot facet approach doesn’t really lighten the Solr server load, it just collapses what would otherwise be multiple queries into one query.
Conclusion
Reflecting on the above method, pivot facets are possible in Solr 1.4.1 at a cost of n separate queries, where n is the number of levels in the decision tree. So, if reducing the number of round trips between a web application and Solr is the goal, and you need pivot facets in a pre-4.0 Solr, this may well be the ticket.