spacy-transformers icon indicating copy to clipboard operation
spacy-transformers copied to clipboard

Support offset mapping alignment for fast tokenizers

Open adrianeboyd opened this issue 3 years ago • 1 comments

Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with spacy-transformers.

Additional modifications:

  • Update package setup for cython
  • Update CI for compiled package

adrianeboyd avatar Jul 26 '22 12:07 adrianeboyd

Concerns:

  • ~~use_fast should be saved correctly in the tokenizer settings (as a separate PR, #339)~~
  • ~~naming of _align and new methods~~
  • whether we (initially) want an automatic backoff to the old alignment if the new alignment fails in some cases
  • further speed improvements
    • in particular I suspect this should be improved: https://github.com/explosion/spacy-transformers/pull/338/files#diff-f767b104e773bafeb21b7870c06043e8cebea5c4fce193b171b737192d589f62R90

adrianeboyd avatar Jul 26 '22 12:07 adrianeboyd

Yes, this would need to be v1.2 because it will change the behavior of trained models.

adrianeboyd avatar Aug 12 '22 12:08 adrianeboyd

In terms of model performance and speed this implementation appears to be on par with the existing spacy-alignments alignment. In general I think it should able be to be little bit faster, my first guess (as mentioned above) is now here in the renamed version:

https://github.com/explosion/spacy-transformers/pull/338/files#diff-b326f7a5839c4589ffcac094a207251f14e6235d150b35a7ae07459776bcc19aR104

I'm trying to find a reasonable test case for where the actual alignments would make a difference, but most things I've tried so far were a wash. The main known cases are with roberta/gpt2 with the bytes_to_unicode mapping. At least users looking at the alignments would be less concerned about it?

adrianeboyd avatar Aug 25 '22 15:08 adrianeboyd