luceneutil
luceneutil copied to clipboard
Index Wikipedia's hierarchical categories and sub-categories as a FacetField
@rmuir observed in this issue that Wikipedia already has labels/categories per page, and these labels have sub-categories, etc.
This would be another (in addition to the high cardinality but flat RandomLabel
we recently added) great way of testing facet labels.
another alternative would be to just do a simpler facet bench on geonames. maybe it is easier than wrestling wikipedia.
I'll work on this issue as I need it to benchmark the SSDV faceting changes I made. I think using the geonames dataset also makes more sense as I believe we no longer can generate new wiki docs. Also @rmuir pointed out that there is an obvious hierarchy in the geonames dataset.
I took a look at several different datasets that we could use for this benchmark. I first looked at the geonames' allCountries.txt
file and experimented with a hierarchy of country-code/admin1/admin2/admin3/admin4
. Unfortunately though this file at 12M rows, there were only 385688 unique hierarchies and an average of 1 null value in each hierarchy. I also looked at OpenStreetMaps, but extracting hierarchies from this dataset seemed like a hard and inefficient process as detailed in this post. I finally looked at the National Address Database (NAD) which is maintained by the US DOT (it can be found here). It contains 65M rows of addresses in the United States. I got the following data for the NAD using a hierarchy of state/county/street_name/address_number
:
Lines with nulls: 119359
Total lines processed: 65460372
Number of unique hierarchies: 53937396
I think based on this the NAD would be a good candidate for this benchmark as it is very high cardinality and the data is populated for the most part.
ok, but i wasn't aware "very high cardinality" is the goal. Maybe that is what amazon has, but that's not what all users have. Some users have low-cardinality fields too, and faceting should be fast on those?
That is true, we can benchmark facets from both datasets separately then as a low and high cardinality test.
I think we have pretty good low cardinality facet fields. We have two flat ones (dayOfYear
, with 365 unique values, and weekday
with seven values), and one hierarchical YYYY/MM/DD.