llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Update StreamingTextDataset to support truncation with multiple truncated items out.

Open LingxiaoShawn opened this issue 1 year ago • 1 comments

🚀 Feature Request

The current StreamingTextDataset truncate the text/tokens to the max_seq_len directly and throw out all left text/tokens. It is possible to support the truncate the text/tokens to a multiple items each with max_seq_len? In this way if the input items have longer size, it won't be wasted. If this is not easy to support, can you mention a bit about the reason?

Thank you!

LingxiaoShawn avatar Jul 16 '24 17:07 LingxiaoShawn