Anthony MOI

Results 33 comments of Anthony MOI

Indeed, there is no support for binary data. The byte-level here is actually in charge of treating the Unicode at the byte-level as opposed to Unicode code-points.

Hi @bfelbo You are right about this. The `WordPiece` training algorithm has never been publicly released though, only the tokenization part of it. In the various papers talking about it,...

Hi @mandubian. Unfortunately, I'm not sure to entirely understand what you would like to do. Can you be more specific, and provide an example of what you are trying to...

Hey @seyyaw, @taesiri. TLDR; This is how the byte-level BPE works. Main advantages are: - Smaller vocabularies - No unknown token This is totally expected behavior. The byte-level BPE converts...

I'd say the main reason is that the main path we imagined for Truncation and Padding, is by using `with_truncation` and `with_padding` on the `Tokenizer` directly. By doing so, the...

Another thing to consider is that the README is now completely out of sync with the last version on crates.io. We didn't release a Rust version for quite some time...

Sure, being able to test what's in the README would help a lot! Everything that is related to the Rust documentation can fit in this issue, but feel free to...

Indeed, this is a known limitation of the library we use to show progress. Unfortunately, it won't show anything if not attached to a "real" terminal, so it does not...

Well, that's really weird. Such an error originating into `enable_truncation` seems very unlikely, I'm confused. Having a way to reproduce this would be ideal, but otherwise, if you can provide...

Thank you very much @severinsimmler, this is very helpful. We can keep the issue open here since it is mostly related to this project, no worries! I was not able...