TileDB icon indicating copy to clipboard operation
TileDB copied to clipboard

Extend Arrow support to cover nullable data

Open eddelbuettel opened this issue 1 year ago • 5 comments

The arrowio header provides import export support from/to TileDB and Arrow with its interface of two void* pointers. This PR extends the support to cover 'nullable' aka 'validity map' data.

The PR is in need of some tests but the existing tests is a little involved between Python, pybind11 and C++ so @ihnorton has kindly 'volunteered' to add this.

The PR will remain a draft til we have tests.


TYPE: FEATURE DESC: Nullable support for Arrow

eddelbuettel avatar Apr 20 '23 20:04 eddelbuettel

Looks like a few msvc errors, https://github.com/TileDB-Inc/TileDB/actions/runs/4758463351/jobs/8456556083?pr=4049#step:10:593

 D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(624,15): error C3863: array type 'uint8_t [n]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(650,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(651,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(657,24): error C3863: array type 'int64_t [num_offsets+1]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]

Shelnutt2 avatar Apr 21 '23 02:04 Shelnutt2

I also saw an Ubuntu build fail but figure that would be spurious. My changes so far are quite gingerly and limited. Let's see what happens once new tests get injected.

eddelbuettel avatar Apr 21 '23 02:04 eddelbuettel

@eddelbuettel if you see my last commit, we were basically never testing col sizes greater than zero. My change cause the test to fail with the message:

  WriterBase: Invalid offsets for attribute utf_string1; the last offset: 0 is
  not equal to the size of the data buffer: 71

or something similar.

Do you have any idea? Spent some hours but couldn't find any solution.

kounelisagis avatar Jun 06 '24 18:06 kounelisagis

Howdy. Love that you are trying to resurrect the PR. Theses days it would probably be better to wrap some of the code in the nanoarrow (header-only) helper functions we already use in tiledb-r and tiledbsoma (mostly the R parts). See https://github.com/apache/arrow-nanoarrow and check out the recently expanded python side (to manipulate objects; nanoarrow is really three packages namely the header-only C/C++ helper, a somewhat mature R package to work on object s and a growing Python package).

I think I wrote this PR after I had similar work in the R package where I since have changed things. (Just took a look.) Hm, that is on the other hand quite motivated by SOMA's libtiledbsoma and its ColumnBuffers etc.

My pedestrian approach has usually been to setup a similar mock function standalone, or as an R extension, and ten abstract out the pure C++ side for Core.

Back to your PR: I honestly do not know what the col size is mad about. Is there something driving this now and/or urgency?

eddelbuettel avatar Jun 06 '24 18:06 eddelbuettel