Adan icon indicating copy to clipboard operation
Adan copied to clipboard

processing data for BERT experiment

Open kenoharada opened this issue 1 year ago • 4 comments

The following steps are modified from Fairseq-Roberta. For completeness, we list some key steps here.

I would like to ask why you modified the dataset settings? In the original fairseq, it seems we can just download the raw data.

https://github.com/sail-sg/Adan/tree/main/NLP/BERT#ii-generate-raw-data Can you share the code for generating raw code?

kenoharada avatar Apr 29 '23 05:04 kenoharada

@kenoharada Thanks for your interest. In fact, Fairseq-Roberta just the WikiText-103 dataset for showing the steps. For other data, we should do some processes to match the requirement of GPT-2 BPE. If your data already satisfy the pattern I described in: https://github.com/sail-sg/Adan/tree/main/NLP/BERT#ii-generate-raw-data You can skip that step.

For generating raw data code, I may reply to you in fewer days since I just cp the data from the other project. I need to ask the author of that project to share the code. BZW, I could also provide you the data directly if you needed it.

XingyuXie avatar Apr 29 '23 06:04 XingyuXie

@XingyuXie Hi, thank you for the reply!

For other data, we should do some processes to match the requirement of GPT-2 BPE.

Thank you, I understand the necessity of the modification.

For generating raw data code, I may reply to you in fewer days since I just cp the data from the other project. I need to ask the author of that project to share the code

If it is possible, I would like to ask you to share the code.

BZW, I could also provide you the data directly if you needed it.

Thank you very much!! I really appreciate if you share the processed raw data to run the experiment.

kenoharada avatar Apr 29 '23 06:04 kenoharada

@kenoharada Here is the code for the data download and process. download_data.py.zip

For data sharing, I need to make a simple application, and after approval, I will put the link here.

XingyuXie avatar Apr 29 '23 07:04 XingyuXie

@XingyuXie Thank you very much!! I will try it!

kenoharada avatar Apr 29 '23 12:04 kenoharada