data icon indicating copy to clipboard operation
data copied to clipboard

Zipped csv file cannot be parsed using torchdata’s CSVParser datapipe

Open seunggs opened this issue 2 years ago • 1 comments

🐛 Describe the bug

I’m trying to parse a single csv file that is zipped and stored in aws s3, but getting the following error:

Exception when executing new-line character seen in unquoted field - do you need to open the file in 
universal-newline mode?\nThis exception is thrown by __iter__ of CSVParserIterDataPipe(fmtparams={'delimiter': ','}, source_datapipe=ZipArchiveLoaderIterDataPipe)

Here’s my code:

from torchdata.datapipes.iter import IterableWrapper

dp = IterableWrapper(["s3://..."]).list_files_by_fsspec()
dp = dp.open_files_by_fsspec(mode="rb")
dp = dp.load_from_zip()
dp = dp.parse_csv(delimiter=",")

for _, row in dp:
    print(row)

It looks like this error is due to how zip file was created (I downloaded the zip file from Kaggle), but I’m not sure what the solution is. Trying to change the mode to “rU” throws Unimplemented error.

Using a csv file directly (rather than zipped csv) works as expected.

Versions

PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.3.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.10 (main, Mar  8 2023, 15:25:33) [Clang 14.0.0 (clang-1400.0.29.202)] (64-bit runtime)
Python platform: macOS-13.3.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] torch==2.0.0
[pip3] torchdata==0.6.0
[pip3] torchvision==0.15.1
[conda] blas                      1.0                         mkl  
[conda] mkl                       2021.4.0           hecd8cb5_637  
[conda] mkl-service               2.4.0            py37h9ed2024_0  
[conda] mkl_fft                   1.3.1            py37h4ab4a9b_0  
[conda] mkl_random                1.2.2            py37hb2f4e1b_0  
[conda] numpy                     1.21.2           py37h4b4dc7a_0  
[conda] numpy-base                1.21.2           py37he0bd621_0  
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1  
[conda] pytorch                   1.5.1                   py3.7_0    pytorch
[conda] pytorch-lightning         1.1.7                    pypi_0    pypi
[conda] torchvision               0.6.1                  py37_cpu    pytorch

seunggs avatar May 24 '23 17:05 seunggs

Try add dialect=csv.excel_tab to dp. parse_csv?

ejguan avatar May 24 '23 17:05 ejguan