multilexnorm2021 icon indicating copy to clipboard operation
multilexnorm2021 copied to clipboard

file not found

Open amdongyang opened this issue 4 years ago • 4 comments

I can't find the file named /utility/WikiExtractor.py used in initialize.sh. The file seems to be important for synthetic pre-training

amdongyang avatar Dec 30 '21 01:12 amdongyang

Hi, you can get the file here, for example: https://github.com/nawnoes/data-preprocess/blob/master/WikiExtractor.py

Note that you actually don't have to download, extract and process the wiki dumps -- we have also released the processed dumps used to train our system here: https://github.com/ufal/multilexnorm2021/releases/tag/v1.0.0

davda54 avatar Dec 30 '21 18:12 davda54

Thanks a lot for your help. I have another question.

After synthetic pre-training, i need to load the saved checkpoint, and fine-tuning the synthetic-pretraining checkpoint with hand-annotated traing data.

This procedure is right or not? Now i fine-tune the byt5 model directly with hand-annotated traing data, and i can only get ERR with 70.15 on En language.

amdongyang avatar Dec 31 '21 03:12 amdongyang

That sounds alright. I'm not sure what validation dataset you use, but reducing the error by 70% seems good to me :)

davda54 avatar Jan 03 '22 10:01 davda54

As for the validation dataset, i simply use the test file under path (/data/multilexnorm/test/eval/test/intrinsic_evaluation/en/test.norm.masked), and i am tring to achieve the performance reported in the paper (73.8 on En language)

amdongyang avatar Jan 04 '22 02:01 amdongyang