spacyface icon indicating copy to clipboard operation
spacyface copied to clipboard

Meta tokenization and tokenization different for string of form "SPACY_EXCEPTION-<anything>"

Open bhoov opened this issue 5 years ago • 0 comments

Tokenization is perfectly aligned for many english sentences, but breaks whenever a SPACY_EXCEPTION is part of a larger, hyphenated word.

For example, "whatve-you-dont" would produce two different tokenizations:

alnr = BertAligner.from_pretrained('bert-base-uncased')
s = "whatve-you-dont"
alnr.tokenize(s) # => ['what', '##ve', '-', 'you', '-', 'don', '##t']
[t.token for t in alnr.meta_tokenize(s)] # => ['what', 'have', '-', 'you', '-', 'do', 'not']

In practice, this is not a huge problem. But worth acknowledging.

This is caused by (and can therefore be fixed in) the doc_to_fixed_tokens function at aligner.doc_to_fixed_tokens#L18

bhoov avatar Jan 20 '20 13:01 bhoov