llm-foundry
llm-foundry copied to clipboard
Update StreamingTextDataset to support truncation with multiple truncated items out.
🚀 Feature Request
The current StreamingTextDataset truncate the text/tokens to the max_seq_len directly and throw out all left text/tokens. It is possible to support the truncate the text/tokens to a multiple items each with max_seq_len? In this way if the input items have longer size, it won't be wasted. If this is not easy to support, can you mention a bit about the reason?
Thank you!