cnn-dailymail icon indicating copy to clipboard operation
cnn-dailymail copied to clipboard

Incorrectly formatted line in vocabulary file

Open bwang482 opened this issue 6 years ago • 1 comments

Why is for example 0800 555 111 356 included in the generated vocab file? This example is at line 23163. Or is it just me who have this problem?

>>> with open('data/cnn-dailymail/vocab', 'r') as vocab_f:
...      for line in vocab_f:
...          pieces = line.split()
...          if len(pieces) != 2:
...             print(pieces)
... 
['0800', '555', '111', '356']
['1800', '333', '000', '139']
['2', '1/2', '124']
['3', '1/2', '86']
['1', '1/2', '68']
['0800', '555111', '59']
['4', '1/2', '47']
['0844', '472', '4157', '41']
['5', '1/2', '39']
['7', '1/2', '25']
['6', '1/2', '24']
['9', '1/2', '21']
['020', '7629', '9161', '19']
['8', '1/2', '19']
['0300', '123', '8018', '19']
['0808', '800', '5000', '19']
['11', '1/2', '18']
['0844', '493', '0787', '14']
['1300', '659', '467', '13']
['16', '1/2', '12']
['13', '1/2', '12']
['1800', '273', '8255', '11']
['18', '1/2', '10']
['0300', '1234', '999', '10']
['0845', '790', '9090', '10']
['0845', '634', '1414', '9']
['14', '1/2', '8']
['0207', '938', '6364', '8']
['0207', '938', '6683', '8']
['310', '642', '2317', '7']
['at', 'uefa.com', '7']
['0207', '386', '0868', '7']
['0808', '800', '2222', '6']
['0800', '789', '321', '6']
['0800', '854', '440', '6']

bwang482 avatar Oct 05 '17 00:10 bwang482

That's intended, see normalizeSpace at https://nlp.stanford.edu/software/tokenizer.html. It will emit phone numbers (such as 0800 555 111) and numbers with fractions (such as 2 1/2) as a single token with non-breakable spaces in between. Not sure why at uefa.com is joined as well, but I get the same result as you.

f0k avatar Oct 20 '17 16:10 f0k