OLMo
OLMo copied to clipboard
how to train model base on v1_6-sample dataset on local trainset
❓ The question
the train.py is base on cloud dataset,however,I have download the v1_6-sample dataset from https://huggingface.co/datasets/allenai/dolma. In this time ,I want just try train model for fun on this local dataset ,may you help me how to do it? in config ,the data is npy file,where in v1_6-sample,the datatype is json .so they are no match
Hey @scalaboy, you can use the tools in Dolma to tokenize the JSON files. See https://github.com/allenai/dolma/blob/main/docs/tokenize.md, for example.