bumblebee
bumblebee copied to clipboard
Add M2M100
WIP: Fixing some tests
Looks like M2M100 doesn't have a fast variant (i.e. delegating to huggingface/tokenizers).
Hey guys, how is it going? Any progress on this PR? I would love to offer my help with this feature, as I want to use it in my project. Please let me know if there is a way somehow I could help 😄
Hey @marinac-dev! It's been a while so we need to rewrite the PR, but that's the easy part. The blocking part is fast tokenizer for M2M100 in huggingface/tokenizers, as they currently don't support it. According to https://github.com/huggingface/transformers/pull/10236#issuecomment-791986885 it was planned, but seems they didn't get to it.
Not sure if that helps: there's also https://github.com/guillaume-be/rust-tokenizers that has an implementation of a rust tokenizer for both M2M100 and Marian.
There's still no fast tokenizer in hf/transformers. We can always revisit if applicable, but I'm going to close this. There are most likely better models already anyway.