arrow-julia
arrow-julia copied to clipboard
Re-use PyArrow memory via PyCall
Hi @quinnj , thank you for just willing to consider this wild attempt! The only pkg you need to re-create is PyArrow and awkward-1.0 on the python side and PyCall.jl on Julia side.
Create example arr
:
julia> using PyCall
julia> ak = pyimport("awkward");
julia> arr = ak.Array(py"[[1,2,3], [], [4,5]]")
PyObject <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>
julia> arr.layout
PyObject <ListOffsetArray64>
<offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x0000037ff330"/></offsets>
<content><NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x000003a256a0"/></content>
</ListOffsetArray64>
Then you can get an pyarrow object via:
julia> arr_arrow = ak.to_arrow(arr)
PyObject <pyarrow.lib.ListArray object at 0x7f2144343048>
...
..
julia> @time [Int64[x...] for x in arr]
0.034800 seconds (38.72 k allocations: 2.031 MiB)
3-element Array{Array{Int64,1},1}:
[1, 2, 3]
[]
[4, 5]
Currently the fastest / least copy method of re-using as been:
function view_ak(arr)
c = PyArray(arr.layout."content")
o = PyArray(arr.layout."offsets")
@views [c[o[i]+1:o[i+1]] for i in 1:length(o)-1]
end
julia> @time view_ak(arr)
0.000089 seconds (37 allocations: 1.609 KiB)
3-element Array{SubArray{Int64,1,PyArray{Int64,1},Tuple{UnitRange{Int64}},false},1}:
[1, 2, 3]
0-element view(::PyArray{Int64,1}, 4:3) with eltype Int64
[4, 5]
This looks super interesting. I've been working to find a way to reuse c++ memory in Julia but I've been running into some roadblocks. Once we do this, it should be possible to use all of c++'s arrow functionality including their crazy fast parquet reader and gandiva, and then create a very performant arrow based DataFrame library on top of it.
@Moelf Are you familiar with c++?
Sorry for the slow response here, but here's one way we could convert between the awkward array and a Julia array:
julia> off = Arrow.Offsets(UInt8[], arr.layout.offsets)
3-element Arrow.Offsets{Int64}:
(1, 3)
(4, 3)
(4, 5)
julia> @time list = Arrow.List{Vector{Int64}, Int64, typeof(arr.layout.content)}(UInt8[], Arrow.ValidityBitmap(UInt8[], 1, 0, 0), off, arr.layout.content, arr.__len__(), nothing)
0.000102 seconds (64 allocations: 3.422 KiB)
3-element Arrow.List{Vector{Int64}, Int64, Vector{Int64}}:
[1, 2, 3]
0-element view(::Vector{Int64}, 4:3) with eltype Int64
[4, 5]
I'm by no means a PyCall.jl expert, so it's unclear to me if, when we do arr.layout.offsets
, a copy of the data is made; judging from the allocations in the @time
invocation, I'd guess that a copy is being made. There's perhaps a PyCall.jl incantation to avoid making a copy, which would be ideal. Otherwise, perhaps there's a way to get an actual pointer to the awkward offsets/content arrays which we could unsafe_wrap
to avoid allocating.
In general though, it's going to be, IMO, practically impossible to try and re-use arrow memory at the individual column/array level. There are two many factors that would complicate things. On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will), is a main use-case for the arrow format. So if you had an arrow IPC stream written to a memory buffer in c++ and were able to pass the pointer + len to Julia, we'd be able to do Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len))
and that should work just fine.
Hey @quinnj ,
Thanks for responding to the issue! I've actually looked into this issue a bit more, it looks like implementing the C Data Interface would solve this for us, since that is its main use case: https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/
This link shows and example of reusing arrow memory in r and python.
I'm happy to contribute as well if I can get some pointers on where to start
I've been able to create Julia struct for the C Data Interface using Clang.jl, but I'm a bit confused about accessing the C Data Interface on the python side. @Moelf would you know how to do this?
I posted this snippet in Slack, but I think I'll post it here for posterity too. I've made some progress with the C Data Interface but I'm stuck at converting the C Pointer to a Julia struct. Can't quite figure out what I'm doing wrong, but I would appreciate it if someone had a look.
These are the resources I used to build this snippet:
- https://arrow.apache.org/docs/format/CDataInterface.html
- https://gist.github.com/wesm/d48908018c4b7a0d9789a31d10caf525 (click "Display the rendered blob" on the top right)
#ENV["PYTHON"] = "/usr/local/opt/[email protected]/bin/python3"
#using Pkg
#Pkg.build("PyCall")
using PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
ffi = pyimport("pyarrow.cffi").ffi
##
const ARROW_FLAG_DICTIONARY_ORDERED = 1
const ARROW_FLAG_NULLABLE = 2
const ARROW_FLAG_MAP_KEYS_SORTED = 4
struct ArrowSchema
format::Cstring
name::Cstring
metadata::Cstring
flags::Cint
n_children::Cint
children::Ptr{Ptr{ArrowSchema}}
dictionary::Ptr{ArrowSchema}
release::Ptr{Cvoid}
private_data::Ptr{Cvoid}
end
struct ArrowArray
length::Cint
null_count::Cint
offset::Cint
n_buffers::Cint
n_children::Cint
buffers::Ptr{Ptr{Cvoid}}
children::Ptr{Ptr{ArrowArray}}
dictionary::Ptr{ArrowArray}
release::Ptr{Cvoid}
private_data::Ptr{Cvoid}
end
##
df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
##
c_schema_py = ffi.new("struct ArrowSchema*")
c_schema_ptr = ffi.cast("uintptr_t", c_schema_py).__int__()
c_batch_py = ffi.new("struct ArrowArray*")
c_batch_ptr = ffi.cast("uintptr_t", c_batch_py).__int__()
##
rb.schema._export_to_c(c_schema_ptr)
rb._export_to_c(c_batch_ptr)
##
c_schema_jl = unsafe_load(convert(Ptr{ArrowSchema}, c_schema_ptr))
@assert c_schema_jl.n_children == c_batch_py.n_children
On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will)
that would actually be useful for 99% of the cases. As a concrete example, if we can generally re-use a pyarrow.Table
construct, it would make 99% use case possible because almost everything interfaces to this object in python land.
ugh.... the unwrap_pyarrow_table
is in Cython, so dumb
I have re-kindled hope for this, from https://arrow.apache.org/docs/python/ipc.html
if we replace sink
with a Julia buffer does that work? @quinnj I think this is the IPC stream chunk (of bytes) you mentioned earlier, that supposedly Julia can just
Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len))
around?
https://gist.github.com/Moelf/de9e6be8575ed5a0399e04637fb935cf
with the last gist I posted, it's practically viable, with one open question being how can we minimize the actual bytes movement in this call
julia> pywith(pa.ipc.new_stream(jl_sink, batch.schema)) do writer
writer.write_batch(batch)
end;
essentially, the question is if it's possible to get the ipc bytes for a pyarrow batch:
Python RecordBatch:
pyarrow.RecordBatch
without copying them through a Julia IOBuffer