transformers icon indicating copy to clipboard operation
transformers copied to clipboard

:rotating_light: :rotating_light: :rotating_light: Fixing BPE spm converter.

Open Narsil opened this issue 1 year ago • 2 comments

What does this PR do?

The spm BPE converter seemed to have been wrong (for quite a while if true). The merges are recreated from the vocab, but where ordered by their vocab id instead of the score within spm vocab. It seems to be wrong for Llama.

This PR fixes it.

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Narsil avatar Mar 23 '23 23:03 Narsil

The documentation is not available anymore as the PR was closed or merged.

@sgugger

At least reformer and camembert are concerned (some test failed when writing bogus code here.)

Narsil avatar Mar 23 '23 23:03 Narsil

IT's breaking and breaking reformer and xlnet, I removed the breaking part of it for Llama.

Narsil avatar Apr 04 '23 14:04 Narsil