datatrove
datatrove copied to clipboard
Support int32 in substring dedup
I'm using a tokenizer with > 100k vocab size, so the token id overflow as it is stored in uint16. I'm wondering if we can add support for int32? Is it possible to simply change the type or is there other places that need to be changed?