FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

Need help with English datasets

Open bharadwajyadati opened this issue 2 years ago • 2 comments

HI , amazing work , highly inspirational. Thanks a lot for make it opensource .

which datasets did you use for pre-training english only model? , it is mentioned that you used the below for pretraining https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md

Pre-training data:

English: Pile, wikipedia, and msmarco

can you let us know did use entire pile + wiki + msmarco for tine turning or a sample of it?

Also for fine turning , you mentioned 200 million English text pairs (634GB) https://data.baai.ac.cn/details/BAAI-MTP

sentence-transformers Data wikipedia cc-net stackexchange reddit S2orc

again , can you let us know if you have used the entire corpus of them or just a portion of them.

Thanks

bharadwajyadati avatar Oct 12 '23 18:10 bharadwajyadati

Hi, thanks for your interest in our work! We use the entire corpus to do pretrain and fine-tune.

staoxiao avatar Oct 13 '23 07:10 staoxiao

@staoxiao , Is it possible to share the train data that you collected from various sources as a single zip file. I'm trying to reproduce your training process with few tweaks, having the exact dataset will be helpful in making comparisions. Looking forward to your response.

ashokrajab avatar Jan 20 '24 11:01 ashokrajab

Hi, also curious if it'd be possible to release the formatted english training dataset. And do you happen to know how many tokens were pre-trained on in total?

austinmw avatar Mar 28 '24 17:03 austinmw