spacy-transformers
spacy-transformers copied to clipboard
Support offset mapping alignment for fast tokenizers
Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with spacy-transformers.
Additional modifications:
- Update package setup for cython
- Update CI for compiled package
Concerns:
- ~~
use_fastshould be saved correctly in the tokenizer settings (as a separate PR, #339)~~ - ~~naming of
_alignand new methods~~ - whether we (initially) want an automatic backoff to the old alignment if the new alignment fails in some cases
- further speed improvements
- in particular I suspect this should be improved: https://github.com/explosion/spacy-transformers/pull/338/files#diff-f767b104e773bafeb21b7870c06043e8cebea5c4fce193b171b737192d589f62R90
Yes, this would need to be v1.2 because it will change the behavior of trained models.
In terms of model performance and speed this implementation appears to be on par with the existing spacy-alignments alignment. In general I think it should able be to be little bit faster, my first guess (as mentioned above) is now here in the renamed version:
https://github.com/explosion/spacy-transformers/pull/338/files#diff-b326f7a5839c4589ffcac094a207251f14e6235d150b35a7ae07459776bcc19aR104
I'm trying to find a reasonable test case for where the actual alignments would make a difference, but most things I've tried so far were a wash. The main known cases are with roberta/gpt2 with the bytes_to_unicode mapping. At least users looking at the alignments would be less concerned about it?