PL-BERT issues

No shards being saved

3

I'm having trouble running the [preprocess jupyter notebook](https://github.com/yl4579/PL-BERT/blob/main/preprocess.ipynb) you provided. I was trying to create PL-BERT for Slovak language but even when I try to run the code you provided,...

martinambrus

Vocab Loss = 0.0

I try training PLBert for vietnamese (using multilingual Bert based model with wiki-vi dataset), but Vocab loss is 0.0 (begin first step), is it okay? @yl4579 Step [19920/1000000], Loss: 0.33009,...

duykhanhbk

Update README.md

A small typo

meets2tarun

Irregular Loss Pattern ; getting "Loss:NaN"

1

TL;DR: * Encountering frequent NaN values mainly for the Loss, during training with [a large JPN dataset ](https://huggingface.co/datasets/oshizo/japanese-wikipedia-paragraphs)(10.5 million rows). * No such issues with another, albeit [smaller dataset ](https://huggingface.co/datasets/range3/wiki40b-ja)(800,000...

SoshyHayami

support for multiliungal tokenizer with hooks for malayalam, adding range support

3

fix for https://github.com/yl4579/PL-BERT/issues/29 and support for malaylam ``` text = 'hello (1200 - 1230)' out = normalize_text(text) print(out) hello (one thousand two hundred to one thousand two hundred thirty) ```

dsplog

RuntimeError: CUDA error: device-side assert triggered on criterion

1

I saw issues about this error. #28 But, I don't know how to solve this error.. I don't know how to write a code that skips the error. Can you...

junylee11

Possible bug in masked index generation?

9

https://github.com/yl4579/PL-BERT/blob/592293aabcb21096eb7f5bffad95a3d38ba4ae6c/dataloader.py#L83 Hi, why the masked_index is extended for 15% of tokens? If I understand correctly, the extention should be placed inside the else statement at line # 80, right?

tekinek

Preprocessing code for Chinese

5

Do you have any suggestions for Chinese data preprocessing? For example, text normalization, g2p, etc. From your experience, will the accuracy of the g2p model have great impact on the...

TinaChen95

How to preprocess a large text dataset (approximately 80 GB)

1

I am trying to preprocess a huge text dataset (non English) as per the code of preprocess.ipynb as provided in the repo itself. In order to do so, I have...

SandyPanda-MLDL

Update README.md

fix typo

yukiarimo

PL-BERT
PL-BERT copied to clipboard

Metadata

No shards being saved

Vocab Loss = 0.0

Update README.md

Irregular Loss Pattern ; getting "Loss:NaN"

support for multiliungal tokenizer with hooks for malayalam, adding range support

RuntimeError: CUDA error: device-side assert triggered on criterion

Possible bug in masked index generation?

Preprocessing code for Chinese

How to preprocess a large text dataset (approximately 80 GB)

Update README.md

← Metadata

Owner

Metadata

PL-BERT PL-BERT copied to clipboard

Metadata

← Metadata

Owner

Metadata

PL-BERT
PL-BERT copied to clipboard