luceneutil icon indicating copy to clipboard operation
luceneutil copied to clipboard

Index Wikipedia's hierarchical categories and sub-categories as a FacetField

Open mikemccand opened this issue 3 years ago • 6 comments

@rmuir observed in this issue that Wikipedia already has labels/categories per page, and these labels have sub-categories, etc.

This would be another (in addition to the high cardinality but flat RandomLabel we recently added) great way of testing facet labels.

mikemccand avatar Dec 03 '21 17:12 mikemccand

another alternative would be to just do a simpler facet bench on geonames. maybe it is easier than wrestling wikipedia.

rmuir avatar Dec 03 '21 17:12 rmuir

I'll work on this issue as I need it to benchmark the SSDV faceting changes I made. I think using the geonames dataset also makes more sense as I believe we no longer can generate new wiki docs. Also @rmuir pointed out that there is an obvious hierarchy in the geonames dataset.

mdmarshmallow avatar Dec 03 '21 19:12 mdmarshmallow

I took a look at several different datasets that we could use for this benchmark. I first looked at the geonames' allCountries.txt file and experimented with a hierarchy of country-code/admin1/admin2/admin3/admin4. Unfortunately though this file at 12M rows, there were only 385688 unique hierarchies and an average of 1 null value in each hierarchy. I also looked at OpenStreetMaps, but extracting hierarchies from this dataset seemed like a hard and inefficient process as detailed in this post. I finally looked at the National Address Database (NAD) which is maintained by the US DOT (it can be found here). It contains 65M rows of addresses in the United States. I got the following data for the NAD using a hierarchy of state/county/street_name/address_number:

Lines with nulls:  119359
Total lines processed:  65460372
Number of unique hierarchies:  53937396

I think based on this the NAD would be a good candidate for this benchmark as it is very high cardinality and the data is populated for the most part.

mdmarshmallow avatar Dec 08 '21 01:12 mdmarshmallow

ok, but i wasn't aware "very high cardinality" is the goal. Maybe that is what amazon has, but that's not what all users have. Some users have low-cardinality fields too, and faceting should be fast on those?

rmuir avatar Dec 08 '21 12:12 rmuir

That is true, we can benchmark facets from both datasets separately then as a low and high cardinality test.

mdmarshmallow avatar Dec 08 '21 17:12 mdmarshmallow

I think we have pretty good low cardinality facet fields. We have two flat ones (dayOfYear, with 365 unique values, and weekday with seven values), and one hierarchical YYYY/MM/DD.

mikemccand avatar Dec 19 '21 22:12 mikemccand