datatrove Support int32 in substring dedup

Support int32 in substring dedup

Open jordane95 opened this issue 11 months ago • 4 comments

I'm using a tokenizer with > 100k vocab size, so the token id overflow as it is stored in uint16. I'm wondering if we can add support for int32? Is it possible to simply change the type or is there other places that need to be changed?

Mar 08 '24 09:03 jordane95

datatrove datatrove copied to clipboard

Support int32 in substring dedup

datatrove
datatrove copied to clipboard