tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Support for binary data

Open pdeblanc opened this issue 5 years ago • 2 comments

I'm trying to train a transformer on a binary data set, but ByteLevelBPETokenizer raises Exception: stream did not contain valid UTF-8. It seems to me that a byte level tokenizer could operate agnostically on bytes rather than trying to decode unicode strings -- is there no support for binary data?

pdeblanc avatar Oct 20 '20 00:10 pdeblanc

Indeed, there is no support for binary data. The byte-level here is actually in charge of treating the Unicode at the byte-level as opposed to Unicode code-points.

n1t0 avatar Oct 20 '20 19:10 n1t0

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 10 '24 01:05 github-actions[bot]