fastbook
fastbook copied to clipboard
Unexplained oddities in Human Numbers
Looking through 12_nlp_dive.ipynb, there are a couple of oddities that I think are worth explaining.
Missing numbers
First, why does a dataset containing "the first 10,000 numbers written out in English" have only 9,998 items?
(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]
From looking at the files, the numbers "eight thousand" and "ten thousand" are missing.
$ wc -l train.txt
7999 train.txt
$ tail -1 train.txt
seven thousand nine hundred ninety nine
$ wc -l valid.txt
1999 valid.txt
$ head -1 valid.txt
eight thousand one
$ tail -1 valid.txt
nine thousand nine hundred ninety nine
Most common token intuition
Second, this explanation for thousand being the most common token is a bit misleading:
A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at
tokensreminded me that large numbers are written with many words, so on the way to 10,000 you write "thousand" a lot: five thousand, five thousand and one, five thousand and two, etc. Oops! Looking at your data is great for noticing subtle features and also embarrassingly obvious ones.
When looking at the full dataset, the separator is indeed the most common token:
> from collections import Counter
> Counter(tokens).most_common(5)
[('.', 9997),
('hundred', 9000),
('thousand', 8999),
('one', 2900),
('two', 2900)]
The reason that thousand is the most common token when running the following cell is an artefact of: (1) the training-validation split, where the validation set includes the numbers in the range 8,001-9,999; and (2) not adding a separator after the last number in the dataset.
n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
n += y.shape[0]
for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n
The separator would be the most common token in the validation set with equal frequency to thousand if it was added after the last number, and it would be more common than thousand if the training-validation split was random. This may be worth mentioning because none of it falls in the "embarrassingly obvious" category when reading through the notebook. :slightly_smiling_face: