fastbook icon indicating copy to clipboard operation
fastbook copied to clipboard

Chapter 12: Alexis' sidebar about the "thousand" token is confusing

Open ElteHupkes opened this issue 4 years ago • 0 comments

This is a silly issue, but this insert from Alexis broke my brain while reading the book (I'm reading the version made of dead trees, but it's still there on master):

A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at tokens reminded me that large numbers are written with many words, so on the way to 10,000 you write "thousand" a lot: five thousand, five thousand and one, five thousand and two, etc. Oops! Looking at your data is great for noticing subtle features and also embarrassingly obvious ones.

In fact, Alexis' initial guess about the separator was correct - after all, there's at most one instance of thousand in each of the 10,000 numbers (and none in the first 999), but there are 9999 separators. It's true though that in the validation set only thousand is the most common token - because the validation set consists of the numbers above 8000, so every one of the 1999 instance includes thousand, but there are only 1998 separators. The suggestion of the column that the separator is clearly not the right answer after looking at the data is misleading, though.

Alright, so much for this tangent, I'm going back to reading now ;).

ElteHupkes avatar Feb 10 '21 11:02 ElteHupkes