llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

data_prep format

Open tsebaka opened this issue 6 months ago • 0 comments

Hello! question: in data_prep if I use --concat_tokens k, its divide into chunks of k tokens my all data, but if I want to just take sample from my data and truncate by max_tokens or add pad tokens to max_tokens (for each sample from my data)? How it can be done in llm-foundry?

--concat_tokens 2 ["some", "text"] -> ["so", "me", "te", "xt"] I want: max_len=3 ["some", "text", "h"] -> ["som", "tex", "h"]

I know in pretrain LLMs it's useless but in sft I also don't find this data_prep in llm-foundry

tsebaka avatar Apr 10 '25 12:04 tsebaka