mistral-inference icon indicating copy to clipboard operation
mistral-inference copied to clipboard

Missing model card / data sheet with info on pretraining and RLHF datasets

Open mdingemanse opened this issue 1 year ago • 4 comments

At opening-up-chatgpt.github.io we're documenting data sources and degrees of openness along several dimensions for instruction-tuned LLMs. I am looking for information about (1) pretraining dataset and (2) RLHF datasets but have not found any details. The HuggingFace model card says

For full details of this model please read our release blog post

The release blog post provides no information on this at present.

mdingemanse avatar Sep 28 '23 09:09 mdingemanse

Information on the the language composition of the pretraining dataset would also be welcome, as there are no mention on multilingual capabilities of the model in the linked blog post.

aakosm avatar Sep 28 '23 10:09 aakosm

I would like to work on this project!

149189 avatar Sep 29 '23 18:09 149189

Upvote thread

AlexWortega avatar Oct 02 '23 16:10 AlexWortega

FWIW Mistral currently sits in the bottom 5 of the live tracker of LLM openness:

74f48dadd7bef58c

mdingemanse avatar Oct 02 '23 17:10 mdingemanse