arrow-julia
arrow-julia copied to clipboard
Implement C Data integration
This starts work towards supporting teh C data interface for the arrow format, as documented here.
Currently in this PR, it includes struct definitions and basic
methods to allow getting a pointer to an ArrowSchema/ArrowArray
C-compatible struct that can then be populated by another
implementation. For example, with this PR, you can do:
using Arrow, PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
sch = Arrow.CData.getschema() do ptr
rb.schema._export_to_c(Int(ptr))
end
arr = Arrow.CData.getarray() do ptr
rb._export_to_c(Int(ptr))
end
Currently, these ArrowSchema/ArrowArray structs are pretty bare
bones, but it at least lays some ground work for integration. Things we
still need/want to make all this nicer to use/work with:
- Type format string parsing/converting: we need to parse the type
format strings as outlined
here
to figure out what type of data we'll get in the arrays. It'd
probably be best to add a
typefield to the ArrowSchema struct that we'd populate when converting fromCArrowSchema->ArrowSchema - Add a method like
Arrow.ArrowVector(::ArrowSchema, ::ArrowArray)that produced a concreteArrowVectorsubtype, likeArrow.Primitive,Arrow.List, etc. This will be a bit tricky, because have to follow all the same columnar layout trickery that we currently handle for IPC in the table.jlbuildmethods. Perhaps we can refactor all that so we can re-use some code? Otherwise, we might just need to reimplement a bunch of that logic specific to convertingArrrowArrays. - That should give a robust consuming story; for producing, we
probably need a definition like
Arrow.ArrowSchema(a::Arrow.ArrowVector)that produced a validArrowSchema, and then overloads perArrowVectorsubtype likeArrow.ArrowArray(x::Arrow.Primitive)that produced the rightArrowArrayfor a concrete arrow array - Then the last piece we need is just figuring out the right mechanics
for providing a pointer to the
CArrowSchema,CArrowArraystructs once they're populated
If anyone would like to help out, I'm happy to provide as much guidance as possible so others can get their feet wet in some arrow spec nitty-gritty.
cc: @sa-, @Moelf
Codecov Report
Merging #178 (005c946) into main (bdd0e54) will decrease coverage by
2.19%. The diff coverage is0.00%.
@@ Coverage Diff @@
## main #178 +/- ##
==========================================
- Coverage 81.34% 79.15% -2.20%
==========================================
Files 25 26 +1
Lines 3034 3118 +84
==========================================
Hits 2468 2468
- Misses 566 650 +84
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/Arrow.jl | 54.54% <ø> (ø) |
|
| src/cinterface.jl | 0.00% <0.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update bdd0e54...005c946. Read the comment docs.