tokenizers
tokenizers copied to clipboard
Support for binary data
I'm trying to train a transformer on a binary data set, but ByteLevelBPETokenizer raises Exception: stream did not contain valid UTF-8. It seems to me that a byte level tokenizer could operate agnostically on bytes rather than trying to decode unicode strings -- is there no support for binary data?
Indeed, there is no support for binary data. The byte-level here is actually in charge of treating the Unicode at the byte-level as opposed to Unicode code-points.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.