data icon indicating copy to clipboard operation
data copied to clipboard

Changing decoding method in StreamReader

Open is-jlehrer opened this issue 1 year ago • 2 comments

🐛 Describe the bug

Hi,

When decoding from a file stream in StreamReader, torchdata automatically assumes the incoming bytes are UTF-8. However, in the case of alternate encoding's this will error (in my case UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 3: invalid continuation byte). How do we change the decoding method to fit the particular data stream?

Versions

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.0
[pip3] pytorch-lightning==1.6.4
[pip3] torch==1.11.0
[pip3] torchdata==0.3.0
[pip3] torchmetrics==0.9.1
[pip3] torchvision==0.12.0
[conda] numpy                     1.23.0                   pypi_0    pypi
[conda] pytorch-lightning         1.6.4                    pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torchdata                 0.3.0                    pypi_0    pypi
[conda] torchmetrics              0.9.1                    pypi_0    pypi
[conda] torchvision               0.12.0                   pypi_0    pypi

is-jlehrer avatar Jul 27 '22 00:07 is-jlehrer

To be more specific, is there no way to read from StreamReader as bytes?

is-jlehrer avatar Jul 27 '22 00:07 is-jlehrer

It depends on how you open your file, rather than StreamReader. If you use FileOpener (functional API as open_files), you can specify the encoding to b to open file in bytes.

ejguan avatar Jul 27 '22 13:07 ejguan