arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner

Open joosthooz opened this issue 3 years ago • 3 comments

This is an alternative version of https://github.com/apache/arrow/pull/13709, to compare what the best approach is.

Instead of extending the C++ ReadOptions struct with an encoding field, this implementations adds a python version of the ReadOptions object to both CsvFileFormat and CsvFragmentScanOptions. The reason it is needed in both places, is to prevent these kinds of inconsistencies:

>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'

joosthooz avatar Aug 09 '22 08:08 joosthooz

https://issues.apache.org/jira/browse/ARROW-16000

github-actions[bot] avatar Aug 09 '22 08:08 github-actions[bot]

Ok, I'll close the other PR then, and let's focus on this. I'm seeing some errors in pytest, but I'm also getting those on the commit that I've branched off of. But more concerning is this error in a Windows build here https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/44413284/job/ic6l7a0ehrnpk21g#L2654 :

FAILED: lib.cp38-win_amd64.pyd 
cmd.exe /C "cd . && C:\Miniconda37-x64\envs\arrow\Library\bin\cmake.exe -E vs_link_dll --intdir=CMakeFiles\lib.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100183~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100183~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo CMakeFiles\lib.dir\lib.cpp.obj  /out:lib.cp38-win_amd64.pyd /implib:lib.lib /pdb:lib.pdb /dll /version:0.0 /machine:x64  /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO  C:\Miniconda37-x64\envs\arrow\libs\python38.lib  C:\Miniconda37-x64\envs\arrow\Library\lib\arrow.lib  C:\Miniconda37-x64\envs\arrow\Library\lib\arrow_python.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib  && cd ."
LINK: command "C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo CMakeFiles\lib.dir\lib.cpp.obj /out:lib.cp38-win_amd64.pyd /implib:lib.lib /pdb:lib.pdb /dll /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO C:\Miniconda37-x64\envs\arrow\libs\python38.lib C:\Miniconda37-x64\envs\arrow\Library\lib\arrow.lib C:\Miniconda37-x64\envs\arrow\Library\lib\arrow_python.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:lib.cp38-win_amd64.pyd.manifest" failed (exit code 1120) with the following output:
   Creating library lib.lib and object lib.exp
lib.cpp.obj : error LNK2019: unresolved external symbol "class std::shared_ptr<class std::function<class arrow::Result<class std::shared_ptr<class arrow::io::InputStream> > __cdecl(class std::shared_ptr<class arrow::io::InputStream>)> > __cdecl arrow::py::MakeStreamTransformFunc(struct arrow::py::TransformInputStreamVTable,struct _object *)" (?MakeStreamTransformFunc@py@arrow@@YA?AV?$shared_ptr@V?$function@$$A6A?AV?$Result@V?$shared_ptr@VInputStream@io@arrow@@@std@@@arrow@@V?$shared_ptr@VInputStream@io@arrow@@@std@@@Z@std@@@std@@UTransformInputStreamVTable@12@PEAU_object@@@Z) referenced in function "class std::shared_ptr<class std::function<class arrow::Result<class std::shared_ptr<class arrow::io::InputStream> > __cdecl(class std::shared_ptr<class arrow::io::InputStream>)> > __cdecl __pyx_f_7pyarrow_3lib_make_streamwrap_func(struct _object *,struct _object *)" (?__pyx_f_7pyarrow_3lib_make_streamwrap_func@@YA?AV?$shared_ptr@V?$function@$$A6A?AV?$Result@V?$shared_ptr@VInputStream@io@arrow@@@std@@@arrow@@V?$shared_ptr@VInputStream@io@arrow@@@std@@@Z@std@@@std@@PEAU_object@@0@Z)
lib.cp38-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals

It seems like the linker is not able to locate the added C++ function arrow::py::MakeStreamTransformFunc when called from the generated cython code. I do notice that the namespace in C++ is just arrow, not arrow::py, but the same goes for MakeTransformInputStream so I don't know why the discrepancy should cause problems for one but not the other.

joosthooz avatar Aug 09 '22 14:08 joosthooz

There is 1 failure in the CI, it seems to have been a strange time-out.

joosthooz avatar Aug 10 '22 10:08 joosthooz

I added a test by copying and modifying the one for the csv reader, but ran into the following problem: The test for latin-1 encoding works fine. However, the UTF16 test fails because there seems to be something wrong with the schema detection. The equality test between the detected pyarrow.schema and the expected schema fails, even if they seem to be identical from a python point of view (the diffs that pytest prints are identical). Adding some printfs in the C++ code shows that the fields of dataset.schema seem to be empty:

Schema Equals():
this fp: 'S{Fn', other fp: 'S{Fna{@N};Fnb{@O};L}'
this field(0): '', other field(0): 'a: string'

But this looks to me like a different problem with detecting the schema of a UTF16 encoded file. Should I try to create a reproducible example and file a new JIRA? Or is this something we should address here? In the meantime, I removed the test in 47a3462

joosthooz avatar Aug 15 '22 11:08 joosthooz

That sounds like dataset inspection is being done without the transcoder actually being set. I think we do need the test to work. I would expect latin-1 happens to work because it happens that the header row has identical encoding between UTF-8 and latin-1.

lidavidm avatar Aug 15 '22 11:08 lidavidm

I added a test with a non-utf8 character in the column name (latin-1). That works. Looks like something a bit more specific to utf16. I'll investigate further.

joosthooz avatar Aug 16 '22 11:08 joosthooz

After having a better look, here's what seems to be happening:

  • The first part of the test checks if parsing the file as binary still works. But that doesn't work for utf16 because the column names are not utf8. So parsing the column names into the schema fails (silently!).
  • The second part tries to read the file, without specifying an encoding. It expects an exception. However, apparently the dataset reader has no problems with the null values every other character; it will just interpret it as a strange utf8 string.

I've removed those 2 additional checks, and just check if the data is transcoded properly. The 2nd check is still present in the new test_column_names_encoding test (that only tests latin-1)

joosthooz avatar Aug 16 '22 13:08 joosthooz

Hi guys, I think the current state should be ok, the failures seem unrelated to me. Is there anything else left for me to do?

joosthooz avatar Aug 19 '22 10:08 joosthooz

Anything I can do to help move this forward? The failure seems unrelated to me (cat: r/check/arrow.Rcheck/00install.out: No such file or directory)

joosthooz avatar Sep 06 '22 18:09 joosthooz

Benchmark runs are scheduled for baseline = a5ecb0ff0774805b0f912e231eaedf42e7194c36 and contender = cbf0ec0d05fe6301988f3b8f02ea39fead788f6c. cbf0ec0d05fe6301988f3b8f02ea39fead788f6c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.61% :arrow_up:0.41%] test-mac-arm [Failed :arrow_down:2.74% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.39% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] cbf0ec0d ec2-t3-xlarge-us-east-2 [Failed] cbf0ec0d test-mac-arm [Failed] cbf0ec0d ursa-i9-9960x [Finished] cbf0ec0d ursa-thinkcentre-m75q [Finished] a5ecb0ff ec2-t3-xlarge-us-east-2 [Failed] a5ecb0ff test-mac-arm [Failed] a5ecb0ff ursa-i9-9960x [Finished] a5ecb0ff ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Sep 07 '22 05:09 ursabot

['Python', 'R'] benchmarks have high level of regressions. ursa-i9-9960x

ursabot avatar Sep 07 '22 05:09 ursabot