spacyface
spacyface copied to clipboard
Meta tokenization and tokenization different for string of form "SPACY_EXCEPTION-<anything>"
Tokenization is perfectly aligned for many english sentences, but breaks whenever a SPACY_EXCEPTION is part of a larger, hyphenated word.
For example, "whatve-you-dont" would produce two different tokenizations:
alnr = BertAligner.from_pretrained('bert-base-uncased')
s = "whatve-you-dont"
alnr.tokenize(s) # => ['what', '##ve', '-', 'you', '-', 'don', '##t']
[t.token for t in alnr.meta_tokenize(s)] # => ['what', 'have', '-', 'you', '-', 'do', 'not']
In practice, this is not a huge problem. But worth acknowledging.
This is caused by (and can therefore be fixed in) the doc_to_fixed_tokens
function at aligner.doc_to_fixed_tokens#L18