cnn-dailymail
cnn-dailymail copied to clipboard
Incorrectly formatted line in vocabulary file
Why is for example 0800 555 111 356 included in the generated vocab file? This example is at line 23163. Or is it just me who have this problem?
>>> with open('data/cnn-dailymail/vocab', 'r') as vocab_f:
... for line in vocab_f:
... pieces = line.split()
... if len(pieces) != 2:
... print(pieces)
...
['0800', '555', '111', '356']
['1800', '333', '000', '139']
['2', '1/2', '124']
['3', '1/2', '86']
['1', '1/2', '68']
['0800', '555111', '59']
['4', '1/2', '47']
['0844', '472', '4157', '41']
['5', '1/2', '39']
['7', '1/2', '25']
['6', '1/2', '24']
['9', '1/2', '21']
['020', '7629', '9161', '19']
['8', '1/2', '19']
['0300', '123', '8018', '19']
['0808', '800', '5000', '19']
['11', '1/2', '18']
['0844', '493', '0787', '14']
['1300', '659', '467', '13']
['16', '1/2', '12']
['13', '1/2', '12']
['1800', '273', '8255', '11']
['18', '1/2', '10']
['0300', '1234', '999', '10']
['0845', '790', '9090', '10']
['0845', '634', '1414', '9']
['14', '1/2', '8']
['0207', '938', '6364', '8']
['0207', '938', '6683', '8']
['310', '642', '2317', '7']
['at', 'uefa.com', '7']
['0207', '386', '0868', '7']
['0808', '800', '2222', '6']
['0800', '789', '321', '6']
['0800', '854', '440', '6']
That's intended, see normalizeSpace
at https://nlp.stanford.edu/software/tokenizer.html. It will emit phone numbers (such as 0800 555 111
) and numbers with fractions (such as 2 1/2
) as a single token with non-breakable spaces in between. Not sure why at uefa.com
is joined as well, but I get the same result as you.