datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Support int32 in substring dedup

Open jordane95 opened this issue 11 months ago • 4 comments

I'm using a tokenizer with > 100k vocab size, so the token id overflow as it is stored in uint16. I'm wondering if we can add support for int32? Is it possible to simply change the type or is there other places that need to be changed?

jordane95 avatar Mar 08 '24 09:03 jordane95