Patrice Lopez comments

Results 601 comments of


                                            Patrice Lopez

Sub-tokenization with certain transformers

Hello ! Afaik Roberta and its BPE tokenizer are working well in my test with version transformers **4.25.1**, but I think not anymore with version 4.15.0 (but it used to...

Sub-tokenization with certain transformers

I tested with current DeLFT version and Transformers `transformers==4.25.1`, normally no issue: ``` > python3 delft/applications/nerTagger.py train_eval --dataset-type conll2003 --architecture BERT --transformer roberta-base Loading CoNLL 2003 data... 14041 train sequences...

Sub-tokenization with certain transformers

> If the pair does not starts the code is unclear, I don't understand why adding `` We put an empty label to the subtokens added by the tokenizers. This...

Sub-tokenization with certain transformers

``` Should we use rather the is_split_into_words=False and a concatenated input as you suggest? bla bla But it could work I suppose with more motivation :) ``` Actually I just...

Sub-tokenization with certain transformers

@lfoppiano This Roberta model raises encoding issues because its tokenizer is not loaded properly. I could reproduce the error with ```console (env) lopez@trainer:~/delft$ python3 delft/applications/grobidTagger.py citation train_eval --architecture BERT_CRF --transformer...

Sub-tokenization with certain transformers

By adding the model path info in the `resource_registry.json` file: ```json "transformers": [ { "name": "matbert-pedro-scicorpus-20000-vocab_100k", "model_dir": "/media/lopez/T51/embeddings/matbert-pedro-scicorpus-20000-vocab_100k" } ], ``` I have the model loaded I think correctly with...