OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Can long text be splitted into short texts?

Open CoinCheung opened this issue 1 year ago • 0 comments

❓ The question

I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my max_seq_len is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?

CoinCheung avatar Jul 12 '24 08:07 CoinCheung