arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Implement C Data integration

Open quinnj opened this issue 4 years ago • 2 comments

This starts work towards supporting teh C data interface for the arrow format, as documented here.

Currently in this PR, it includes struct definitions and basic methods to allow getting a pointer to an ArrowSchema/ArrowArray C-compatible struct that can then be populated by another implementation. For example, with this PR, you can do:

using Arrow, PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
sch = Arrow.CData.getschema() do ptr
    rb.schema._export_to_c(Int(ptr))
end
arr = Arrow.CData.getarray() do ptr
    rb._export_to_c(Int(ptr))
end

Currently, these ArrowSchema/ArrowArray structs are pretty bare bones, but it at least lays some ground work for integration. Things we still need/want to make all this nicer to use/work with:

  • Type format string parsing/converting: we need to parse the type format strings as outlined here to figure out what type of data we'll get in the arrays. It'd probably be best to add a type field to the ArrowSchema struct that we'd populate when converting from CArrowSchema -> ArrowSchema
  • Add a method like Arrow.ArrowVector(::ArrowSchema, ::ArrowArray) that produced a concrete ArrowVector subtype, like Arrow.Primitive, Arrow.List, etc. This will be a bit tricky, because have to follow all the same columnar layout trickery that we currently handle for IPC in the table.jl build methods. Perhaps we can refactor all that so we can re-use some code? Otherwise, we might just need to reimplement a bunch of that logic specific to converting ArrrowArrays.
  • That should give a robust consuming story; for producing, we probably need a definition like Arrow.ArrowSchema(a::Arrow.ArrowVector) that produced a valid ArrowSchema, and then overloads per ArrowVector subtype like Arrow.ArrowArray(x::Arrow.Primitive) that produced the right ArrowArray for a concrete arrow array
  • Then the last piece we need is just figuring out the right mechanics for providing a pointer to the CArrowSchema, CArrowArray structs once they're populated

If anyone would like to help out, I'm happy to provide as much guidance as possible so others can get their feet wet in some arrow spec nitty-gritty.

quinnj avatar Apr 16 '21 05:04 quinnj

cc: @sa-, @Moelf

quinnj avatar Apr 16 '21 05:04 quinnj

Codecov Report

Merging #178 (005c946) into main (bdd0e54) will decrease coverage by 2.19%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #178      +/-   ##
==========================================
- Coverage   81.34%   79.15%   -2.20%     
==========================================
  Files          25       26       +1     
  Lines        3034     3118      +84     
==========================================
  Hits         2468     2468              
- Misses        566      650      +84     
Impacted Files Coverage Δ
src/Arrow.jl 54.54% <ø> (ø)
src/cinterface.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bdd0e54...005c946. Read the comment docs.

codecov[bot] avatar Apr 16 '21 05:04 codecov[bot]