file not found
I can't find the file named /utility/WikiExtractor.py used in initialize.sh. The file seems to be important for synthetic pre-training
Hi, you can get the file here, for example: https://github.com/nawnoes/data-preprocess/blob/master/WikiExtractor.py
Note that you actually don't have to download, extract and process the wiki dumps -- we have also released the processed dumps used to train our system here: https://github.com/ufal/multilexnorm2021/releases/tag/v1.0.0
Thanks a lot for your help. I have another question.
After synthetic pre-training, i need to load the saved checkpoint, and fine-tuning the synthetic-pretraining checkpoint with hand-annotated traing data.
This procedure is right or not? Now i fine-tune the byt5 model directly with hand-annotated traing data, and i can only get ERR with 70.15 on En language.
That sounds alright. I'm not sure what validation dataset you use, but reducing the error by 70% seems good to me :)
As for the validation dataset, i simply use the test file under path (/data/multilexnorm/test/eval/test/intrinsic_evaluation/en/test.norm.masked), and i am tring to achieve the performance reported in the paper (73.8 on En language)