Adan
Adan copied to clipboard
processing data for BERT experiment
The following steps are modified from Fairseq-Roberta. For completeness, we list some key steps here.
I would like to ask why you modified the dataset settings? In the original fairseq, it seems we can just download the raw data.
https://github.com/sail-sg/Adan/tree/main/NLP/BERT#ii-generate-raw-data Can you share the code for generating raw code?
@kenoharada Thanks for your interest. In fact, Fairseq-Roberta just the WikiText-103 dataset for showing the steps. For other data, we should do some processes to match the requirement of GPT-2 BPE. If your data already satisfy the pattern I described in: https://github.com/sail-sg/Adan/tree/main/NLP/BERT#ii-generate-raw-data You can skip that step.
For generating raw data code, I may reply to you in fewer days since I just cp the data from the other project. I need to ask the author of that project to share the code. BZW, I could also provide you the data directly if you needed it.
@XingyuXie Hi, thank you for the reply!
For other data, we should do some processes to match the requirement of GPT-2 BPE.
Thank you, I understand the necessity of the modification.
For generating raw data code, I may reply to you in fewer days since I just cp the data from the other project. I need to ask the author of that project to share the code
If it is possible, I would like to ask you to share the code.
BZW, I could also provide you the data directly if you needed it.
Thank you very much!! I really appreciate if you share the processed raw data to run the experiment.
@kenoharada Here is the code for the data download and process. download_data.py.zip
For data sharing, I need to make a simple application, and after approval, I will put the link here.
@XingyuXie Thank you very much!! I will try it!