Anthony MOI
Anthony MOI
### Current state The `AddedVocabulary` adds new tokens on top of the `Model`, making the following assumption: "The Model will never change". So, this makes a few things impossible: -...
It is currently impossible to serialize custom Python components, so if a `Tokenizer` embeds some of them, the user can't save it. I didn't really dig this so I don't...
When there is a `` in the trigger, it seems to trigger even when the `VARIABLE` is actually `undefined`. So when we are using something like this: ``` + [*]...
The main README is completely out-of-date. We also want to provide documentation in order to prepare for the crate release. This covers the Rust documentation only, not bindings.
### Current state When we want to train a Tokenizer, we need to give a `Trainer` initialized with a set of custom parameters: ```python tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json") # We need...
This issue is here to keep track of the different subjects around training. - [x] Ability to train from memory (#198) - [ ] Ability to re-train a Tokenizer with...
Currently, the deployment script generates the documentation for the right commit/tag, but it uses the `tokenizers` package built on the last version in `master`. This actually generates the wrong Python...
If we don't set a `PreTokenizer`, the BPE algorithm will run on non-segmented text. This usually works fine (just slower) when processing files with short lines since this gives some...
Whenever we instantiate a new `PostProcessor` (asof writing this issue, one of `BertProcessing`, `RobertaProcessing` or `TemplateProcessing`), we specify pairs of `token: String` and `id: u32` for the special tokens that...