bumblebee icon indicating copy to clipboard operation
bumblebee copied to clipboard

Add M2M100

Open seanmor5 opened this issue 2 years ago • 1 comments

WIP: Fixing some tests

seanmor5 avatar Aug 18 '22 15:08 seanmor5

Looks like M2M100 doesn't have a fast variant (i.e. delegating to huggingface/tokenizers).

jonatanklosko avatar Sep 05 '22 14:09 jonatanklosko

Hey guys, how is it going? Any progress on this PR? I would love to offer my help with this feature, as I want to use it in my project. Please let me know if there is a way somehow I could help 😄

marinac-dev avatar Mar 06 '23 16:03 marinac-dev

Hey @marinac-dev! It's been a while so we need to rewrite the PR, but that's the easy part. The blocking part is fast tokenizer for M2M100 in huggingface/tokenizers, as they currently don't support it. According to https://github.com/huggingface/transformers/pull/10236#issuecomment-791986885 it was planned, but seems they didn't get to it.

jonatanklosko avatar Mar 06 '23 19:03 jonatanklosko

Not sure if that helps: there's also https://github.com/guillaume-be/rust-tokenizers that has an implementation of a rust tokenizer for both M2M100 and Marian.

SteffenDE avatar Jul 25 '23 11:07 SteffenDE

There's still no fast tokenizer in hf/transformers. We can always revisit if applicable, but I'm going to close this. There are most likely better models already anyway.

jonatanklosko avatar Feb 27 '24 11:02 jonatanklosko