TileDB
TileDB copied to clipboard
Extend Arrow support to cover nullable data
The arrowio
header provides import export support from/to TileDB and Arrow with its interface of two void*
pointers. This PR extends the support to cover 'nullable' aka 'validity map' data.
The PR is in need of some tests but the existing tests is a little involved between Python, pybind11 and C++ so @ihnorton has kindly 'volunteered' to add this.
The PR will remain a draft til we have tests.
TYPE: FEATURE DESC: Nullable support for Arrow
This pull request has been linked to Shortcut Story #27472: Extend arrow i/o operations to nullable columns.
Looks like a few msvc errors, https://github.com/TileDB-Inc/TileDB/actions/runs/4758463351/jobs/8456556083?pr=4049#step:10:593
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(624,15): error C3863: array type 'uint8_t [n]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(650,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(651,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(657,24): error C3863: array type 'int64_t [num_offsets+1]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
I also saw an Ubuntu build fail but figure that would be spurious. My changes so far are quite gingerly and limited. Let's see what happens once new tests get injected.
@eddelbuettel if you see my last commit, we were basically never testing col sizes greater than zero. My change cause the test to fail with the message:
WriterBase: Invalid offsets for attribute utf_string1; the last offset: 0 is
not equal to the size of the data buffer: 71
or something similar.
Do you have any idea? Spent some hours but couldn't find any solution.
Howdy. Love that you are trying to resurrect the PR. Theses days it would probably be better to wrap some of the code in the nanoarrow (header-only) helper functions we already use in tiledb-r and tiledbsoma (mostly the R parts). See https://github.com/apache/arrow-nanoarrow and check out the recently expanded python side (to manipulate objects; nanoarrow is really three packages namely the header-only C/C++ helper, a somewhat mature R package to work on object s and a growing Python package).
I think I wrote this PR after I had similar work in the R package where I since have changed things. (Just took a look.) Hm, that is on the other hand quite motivated by SOMA's libtiledbsoma and its ColumnBuffers etc.
My pedestrian approach has usually been to setup a similar mock function standalone, or as an R extension, and ten abstract out the pure C++ side for Core.
Back to your PR: I honestly do not know what the col size is mad about. Is there something driving this now and/or urgency?