Anthony MOI issues

Results 11 issues of


                                            Anthony MOI

AddedVocabulary does not play well with the Model

### Current state The `AddedVocabulary` adds new tokens on top of the `Model`, making the following assumption: "The Model will never change". So, this makes a few things impossible: -...

enhancement

Stale

Add the ability to serialize custom Python components

It is currently impossible to serialize custom Python components, so if a `Tokenizer` embeds some of them, the user can't save it. I didn't really dig this so I don't...

enhancement

python

<get VARIABLE> in triggers

When there is a `` in the trigger, it seems to trigger even when the `VARIABLE` is actually `undefined`. So when we are using something like this: ``` + [*]...

needs test case

Rust documentation

The main README is completely out-of-date. We also want to provide documentation in order to prepare for the crate release. This covers the Rust documentation only, not bindings.

Ability to re-train a Tokenizer with relevant parameters

### Current state When we want to train a Tokenizer, we need to give a `Trainer` initialized with a set of custom parameters: ```python tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json") # We need...

enhancement

Training improvements

This issue is here to keep track of the different subjects around training. - [x] Ability to train from memory (#198) - [ ] Ability to re-train a Tokenizer with...

enhancement

Stale

Fix docs deployment CI

Currently, the deployment script generates the documentation for the right commit/tag, but it uses the `tokenizers` package built on the last version in `master`. This actually generates the wrong Python...

bug

Stale

Improve BPE Training without PreTokenizer

If we don't set a `PreTokenizer`, the BPE algorithm will run on non-segmented text. This usually works fine (just slower) when processing files with short lines since this gives some...

enhancement

Stale

`PostProcessor` might have wrong tokens/ids

Whenever we instantiate a new `PostProcessor` (asof writing this issue, one of `BertProcessing`, `RobertaProcessing` or `TemplateProcessing`), we specify pairs of `token: String` and `id: u32` for the special tokens that...

Stale