arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Re-use PyArrow memory via PyCall

Open Moelf opened this issue 3 years ago • 9 comments

Hi @quinnj , thank you for just willing to consider this wild attempt! The only pkg you need to re-create is PyArrow and awkward-1.0 on the python side and PyCall.jl on Julia side.

Create example arr:

julia> using PyCall

julia> ak = pyimport("awkward");

julia> arr = ak.Array(py"[[1,2,3], [], [4,5]]")
PyObject <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

julia> arr.layout
PyObject <ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x0000037ff330"/></offsets>
    <content><NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x000003a256a0"/></content>
</ListOffsetArray64>

Then you can get an pyarrow object via:

julia> arr_arrow = ak.to_arrow(arr)
PyObject <pyarrow.lib.ListArray object at 0x7f2144343048>
...
..

julia> @time [Int64[x...] for x in arr]
  0.034800 seconds (38.72 k allocations: 2.031 MiB)
3-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 []
 [4, 5]

Currently the fastest / least copy method of re-using as been:

function view_ak(arr)
    c = PyArray(arr.layout."content")
    o = PyArray(arr.layout."offsets")
    @views [c[o[i]+1:o[i+1]] for i in 1:length(o)-1]
end

julia> @time view_ak(arr)
  0.000089 seconds (37 allocations: 1.609 KiB)
3-element Array{SubArray{Int64,1,PyArray{Int64,1},Tuple{UnitRange{Int64}},false},1}:
 [1, 2, 3]
 0-element view(::PyArray{Int64,1}, 4:3) with eltype Int64
 [4, 5]

Moelf avatar Dec 29 '20 01:12 Moelf

This looks super interesting. I've been working to find a way to reuse c++ memory in Julia but I've been running into some roadblocks. Once we do this, it should be possible to use all of c++'s arrow functionality including their crazy fast parquet reader and gandiva, and then create a very performant arrow based DataFrame library on top of it.

@Moelf Are you familiar with c++?

sa- avatar Dec 29 '20 15:12 sa-

Sorry for the slow response here, but here's one way we could convert between the awkward array and a Julia array:

julia> off = Arrow.Offsets(UInt8[], arr.layout.offsets)
3-element Arrow.Offsets{Int64}:
 (1, 3)
 (4, 3)
 (4, 5)

julia> @time list = Arrow.List{Vector{Int64}, Int64, typeof(arr.layout.content)}(UInt8[], Arrow.ValidityBitmap(UInt8[], 1, 0, 0), off, arr.layout.content, arr.__len__(), nothing)
  0.000102 seconds (64 allocations: 3.422 KiB)
3-element Arrow.List{Vector{Int64}, Int64, Vector{Int64}}:
 [1, 2, 3]
 0-element view(::Vector{Int64}, 4:3) with eltype Int64
 [4, 5]

I'm by no means a PyCall.jl expert, so it's unclear to me if, when we do arr.layout.offsets, a copy of the data is made; judging from the allocations in the @time invocation, I'd guess that a copy is being made. There's perhaps a PyCall.jl incantation to avoid making a copy, which would be ideal. Otherwise, perhaps there's a way to get an actual pointer to the awkward offsets/content arrays which we could unsafe_wrap to avoid allocating.

quinnj avatar Apr 14 '21 05:04 quinnj

In general though, it's going to be, IMO, practically impossible to try and re-use arrow memory at the individual column/array level. There are two many factors that would complicate things. On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will), is a main use-case for the arrow format. So if you had an arrow IPC stream written to a memory buffer in c++ and were able to pass the pointer + len to Julia, we'd be able to do Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len)) and that should work just fine.

quinnj avatar Apr 14 '21 05:04 quinnj

Hey @quinnj ,

Thanks for responding to the issue! I've actually looked into this issue a bit more, it looks like implementing the C Data Interface would solve this for us, since that is its main use case: https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/

This link shows and example of reusing arrow memory in r and python.

I'm happy to contribute as well if I can get some pointers on where to start

sa- avatar Apr 15 '21 06:04 sa-

I've been able to create Julia struct for the C Data Interface using Clang.jl, but I'm a bit confused about accessing the C Data Interface on the python side. @Moelf would you know how to do this?

sa- avatar Apr 15 '21 06:04 sa-

I posted this snippet in Slack, but I think I'll post it here for posterity too. I've made some progress with the C Data Interface but I'm stuck at converting the C Pointer to a Julia struct. Can't quite figure out what I'm doing wrong, but I would appreciate it if someone had a look.

These are the resources I used to build this snippet:

  • https://arrow.apache.org/docs/format/CDataInterface.html
  • https://gist.github.com/wesm/d48908018c4b7a0d9789a31d10caf525 (click "Display the rendered blob" on the top right)
#ENV["PYTHON"] = "/usr/local/opt/[email protected]/bin/python3"
#using Pkg
#Pkg.build("PyCall")
using PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
ffi = pyimport("pyarrow.cffi").ffi
##
const ARROW_FLAG_DICTIONARY_ORDERED = 1
const ARROW_FLAG_NULLABLE = 2
const ARROW_FLAG_MAP_KEYS_SORTED = 4

struct ArrowSchema
    format::Cstring
    name::Cstring
    metadata::Cstring
    flags::Cint
    n_children::Cint
    children::Ptr{Ptr{ArrowSchema}}
    dictionary::Ptr{ArrowSchema}
    release::Ptr{Cvoid}
    private_data::Ptr{Cvoid}
end

struct ArrowArray
    length::Cint
    null_count::Cint
    offset::Cint
    n_buffers::Cint
    n_children::Cint
    buffers::Ptr{Ptr{Cvoid}}
    children::Ptr{Ptr{ArrowArray}}
    dictionary::Ptr{ArrowArray}
    release::Ptr{Cvoid}
    private_data::Ptr{Cvoid}
end
##

df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
##
c_schema_py = ffi.new("struct ArrowSchema*")
c_schema_ptr = ffi.cast("uintptr_t", c_schema_py).__int__()
c_batch_py = ffi.new("struct ArrowArray*")
c_batch_ptr = ffi.cast("uintptr_t", c_batch_py).__int__()
##
rb.schema._export_to_c(c_schema_ptr)
rb._export_to_c(c_batch_ptr)
##
c_schema_jl = unsafe_load(convert(Ptr{ArrowSchema}, c_schema_ptr))
@assert c_schema_jl.n_children == c_batch_py.n_children

sa- avatar Apr 15 '21 23:04 sa-

On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will)

that would actually be useful for 99% of the cases. As a concrete example, if we can generally re-use a pyarrow.Table construct, it would make 99% use case possible because almost everything interfaces to this object in python land.

ugh.... the unwrap_pyarrow_table is in Cython, so dumb

Moelf avatar May 23 '21 17:05 Moelf

I have re-kindled hope for this, from https://arrow.apache.org/docs/python/ipc.html

if we replace sink with a Julia buffer does that work? @quinnj I think this is the IPC stream chunk (of bytes) you mentioned earlier, that supposedly Julia can just

Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len))

around?

Moelf avatar Aug 31 '22 01:08 Moelf

https://gist.github.com/Moelf/de9e6be8575ed5a0399e04637fb935cf

Moelf avatar Aug 31 '22 02:08 Moelf

with the last gist I posted, it's practically viable, with one open question being how can we minimize the actual bytes movement in this call

julia> pywith(pa.ipc.new_stream(jl_sink, batch.schema)) do writer
               writer.write_batch(batch)
           end;

essentially, the question is if it's possible to get the ipc bytes for a pyarrow batch:

Python RecordBatch:
pyarrow.RecordBatch

without copying them through a Julia IOBuffer

Moelf avatar Jan 22 '23 21:01 Moelf