fastbook icon indicating copy to clipboard operation
fastbook copied to clipboard

Chapter 9, removing 'fiModelDescriptor' instead of

Open richiethomas opened this issue 4 years ago • 0 comments

I wrote a post about this in the FastAI forums, and someone suggested that this might be an errata in the book? If that is indeed the case, I'd be happy to take a crack at a PR for fixing this, otherwise I'm happy to stand corrected. :-)

Chapter 9 of the book, specifically the section dealing with the "Blue Book For Bulldozers" problem, mentions the multiple, potentially overlapping, high-cardinality categorical variables pertaining to the "model" concept for the tractor equipment to be auctioned. A list of several columns is displayed, which includes two columns with especially high cardinality- ModelID and fiModelDesc, both on the order of 5,000 discrete values. This is said to be potentially sub-optimal, since categories of this magnitude take up relatively large amounts of compute resources, and removing the duplicate columns is mentioned as being one performance optimization we can attempt.

However, when we attempt to test the relative error rate when removing a "duplicate" column, the column we select for removal is not one of those two high-cardinality columns I mentioned. Instead, it's a relatively low-cardinality column called fiModelDescriptor, whose name sounds semantically similar to fiModelDesc but which is actually a separate column entirely with a cardinality of only 140, again according to the aforementioned list.

My question is, was the removal of fiModelDescriptor instead of fiModelDesc intentional or accidental? If intentional, I'm unclear on the intuition here. I agree that fiModelDescriptor could also be a candidate for removal given my layperson's interpretation of the column names, but the 5k size of fiModelDesc makes it seem like relatively low-hanging fruit, if we're prioritizing columns to remove in order of their compute resource cost.

richiethomas avatar Jun 09 '21 18:06 richiethomas