Need help with English datasets
HI , amazing work , highly inspirational. Thanks a lot for make it opensource .
which datasets did you use for pre-training english only model? , it is mentioned that you used the below for pretraining https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md
Pre-training data:
English: Pile, wikipedia, and msmarco
can you let us know did use entire pile + wiki + msmarco for tine turning or a sample of it?
Also for fine turning , you mentioned 200 million English text pairs (634GB) https://data.baai.ac.cn/details/BAAI-MTP
sentence-transformers Data wikipedia cc-net stackexchange reddit S2orc
again , can you let us know if you have used the entire corpus of them or just a portion of them.
Thanks
Hi, thanks for your interest in our work! We use the entire corpus to do pretrain and fine-tune.
@staoxiao , Is it possible to share the train data that you collected from various sources as a single zip file. I'm trying to reproduce your training process with few tweaks, having the exact dataset will be helpful in making comparisions. Looking forward to your response.
Hi, also curious if it'd be possible to release the formatted english training dataset. And do you happen to know how many tokens were pre-trained on in total?