FlagEmbedding Need help with English datasets

HI , amazing work , highly inspirational. Thanks a lot for make it opensource .

which datasets did you use for pre-training english only model? , it is mentioned that you used the below for pretraining https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md

Pre-training data:

English: Pile, wikipedia, and msmarco

can you let us know did use entire pile + wiki + msmarco for tine turning or a sample of it?

Also for fine turning , you mentioned 200 million English text pairs (634GB) https://data.baai.ac.cn/details/BAAI-MTP

sentence-transformers Data wikipedia cc-net stackexchange reddit S2orc

again , can you let us know if you have used the entire corpus of them or just a portion of them.

Thanks

Oct 12 '23 18:10 bharadwajyadati

Hi, thanks for your interest in our work! We use the entire corpus to do pretrain and fine-tune.

Oct 13 '23 07:10 staoxiao

@staoxiao , Is it possible to share the train data that you collected from various sources as a single zip file. I'm trying to reproduce your training process with few tweaks, having the exact dataset will be helpful in making comparisions. Looking forward to your response.

Jan 20 '24 11:01 ashokrajab

Hi, also curious if it'd be possible to release the formatted english training dataset. And do you happen to know how many tokens were pre-trained on in total?

Mar 28 '24 17:03 austinmw