sgpt icon indicating copy to clipboard operation
sgpt copied to clipboard

Could I fine tune this model for Chinese datasets?

Open asenasen123 opened this issue 2 years ago • 11 comments

Could you please tell me how i can fine tune for my custom Chinese datasets?

asenasen123 avatar Aug 18 '23 09:08 asenasen123

Sure if you want to finetune you can follow some of what is outlined in this issue: https://github.com/Muennighoff/sgpt/issues/2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

Muennighoff avatar Aug 18 '23 09:08 Muennighoff

Sure if you want to finetune you can follow some of what is outlined in this issue: #2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

Do many spgt models on Huggingface support Chinese?

asenasen123 avatar Aug 21 '23 01:08 asenasen123

If I want to fine-tune the sgpt model, do I just change the dataset?

asenasen123 avatar Aug 21 '23 01:08 asenasen123

I think only the bloom ones perform well for Chinese. Yes you can just change the dataset.

Muennighoff avatar Aug 21 '23 06:08 Muennighoff

I think only the bloom ones perform well for Chinese. Yes you can just change the dataset.

Which Chinese dataset should I evaluate the fine-tuned model on?

asenasen123 avatar Aug 21 '23 07:08 asenasen123

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard

Also see https://github.com/embeddings-benchmark/mteb/pull/134

Muennighoff avatar Aug 21 '23 07:08 Muennighoff

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard

Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

asenasen123 avatar Aug 21 '23 07:08 asenasen123

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

For retrieval datasets its nDCG@10 ; But don't worry about the evaluation - if you use MTEB it takes care of automatically calculating the scores etc.

Muennighoff avatar Aug 21 '23 07:08 Muennighoff

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

For retrieval datasets its nDCG@10 ; But don't worry about the evaluation - if you use MTEB it takes care of automatically calculating the scores etc.

Thank you very much!

asenasen123 avatar Aug 21 '23 07:08 asenasen123

Sure if you want to finetune you can follow some of what is outlined in this issue: #2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

what about spanish fine tune?

wilfoderek avatar Nov 13 '23 14:11 wilfoderek

Sure if you want to finetune you can follow some of what is outlined in this issue: #2 For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

what about spanish fine tune?

Sure you can do that too. https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco has also seen a lot of Spanish so it may work well for you.

Muennighoff avatar Nov 13 '23 16:11 Muennighoff