llm-foundry
llm-foundry copied to clipboard
data_prep format
Hello! question: in data_prep if I use --concat_tokens k, its divide into chunks of k tokens my all data, but if I want to just take sample from my data and truncate by max_tokens or add pad tokens to max_tokens (for each sample from my data)? How it can be done in llm-foundry?
--concat_tokens 2
["some", "text"] -> ["so", "me", "te", "xt"]
I want:
max_len=3
["some", "text", "h"] -> ["som", "tex", "h
I know in pretrain LLMs it's useless but in sft I also don't find this data_prep in llm-foundry