OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

how to train model base on v1_6-sample dataset on local trainset

Open scalaboy opened this issue 10 months ago • 1 comments

❓ The question

the train.py is base on cloud dataset,however,I have download the v1_6-sample dataset from https://huggingface.co/datasets/allenai/dolma. In this time ,I want just try train model for fun on this local dataset ,may you help me how to do it? in config ,the data is npy file,where in v1_6-sample,the datatype is json .so they are no match

scalaboy avatar Apr 03 '24 04:04 scalaboy

Hey @scalaboy, you can use the tools in Dolma to tokenize the JSON files. See https://github.com/allenai/dolma/blob/main/docs/tokenize.md, for example.

epwalsh avatar Apr 03 '24 16:04 epwalsh