Patrice Lopez

Results 601 comments of Patrice Lopez

Hello ! Afaik Roberta and its BPE tokenizer are working well in my test with version transformers **4.25.1**, but I think not anymore with version 4.15.0 (but it used to...

I tested with current DeLFT version and Transformers `transformers==4.25.1`, normally no issue: ``` > python3 delft/applications/nerTagger.py train_eval --dataset-type conll2003 --architecture BERT --transformer roberta-base Loading CoNLL 2003 data... 14041 train sequences...

> If the pair does not starts the code is unclear, I don't understand why adding `` We put an empty label to the subtokens added by the tokenizers. This...

``` Should we use rather the is_split_into_words=False and a concatenated input as you suggest? bla bla But it could work I suppose with more motivation :) ``` Actually I just...

@lfoppiano This Roberta model raises encoding issues because its tokenizer is not loaded properly. I could reproduce the error with ```console (env) lopez@trainer:~/delft$ python3 delft/applications/grobidTagger.py citation train_eval --architecture BERT_CRF --transformer...

By adding the model path info in the `resource_registry.json` file: ```json "transformers": [ { "name": "matbert-pedro-scicorpus-20000-vocab_100k", "model_dir": "/media/lopez/T51/embeddings/matbert-pedro-scicorpus-20000-vocab_100k" } ], ``` I have the model loaded I think correctly with...

So to summarize there are still 2 problems: 1) Apparently the vocabulary of the model `matbert-pedro-scicorpus-20000-vocab_100k` is not loaded/considered because "É" and "ĠÉ" are present in `tokenizer.json` but their encoding...

I fixed normally the problem 2. above with https://github.com/kermitt2/delft/pull/154 This will fix the error for out of vocabulary characters that can appear from time to time, also for other sentencepiece...

Indeed it seems correctly loaded, I also have: ``` self.tokenizer.vocab['É'] 136 self.tokenizer.vocab['ĠÉ'] 70434 ``` But then still when tokenizing: ``` input: ['Troadec', ',', 'É', '.'] encode: [(0, 0), (0, 3),...

If we pass `['Troadec', ',', 'aī', '.'] ` the token is `Ġaī`, which is also not in the vocabulary, so we go to the lower-level of byte to match the...