Streaming: Pyarrow is 15 times faster than Arrow.jl

Open bilelomrani1 opened this issue 3 years ago • 0 comments

I have an .arrow file generated with pyarrow whose schema is the following:

input: struct<open: fixed_size_list<item: float>[512], high: fixed_size_list<item: float>[512], low: fixed_size_list<item: float>[512], close: fixed_size_list<item: float>[512]> not null
  child 0, open: fixed_size_list<item: float>[512]
      child 0, item: float
  child 1, high: fixed_size_list<item: float>[512]
      child 0, item: float
  child 2, low: fixed_size_list<item: float>[512]
      child 0, item: float
  child 3, close: fixed_size_list<item: float>[512]
      child 0, item: float

With pyarrow, I load and iterate over records with the following:

with pa.memory_map('arraydata.arrow', 'r') as source:
    loaded_arrays = pa.ipc.open_file(source).read_all()

a = 0
for batch in loaded_arrays.to_batches():
    for input_candles in batch["input"]:
        a += 1

Iterating over my example file (~10,000 lines) takes 210 ms.

In julia, I load and iterate over the same file with the following:

stream = Arrow.Stream("./arraydata.arrow")

function bench_iteration(stream)
    a = 0
    for batch in stream
        for sample in batch.input
            a += 1
        end
    end
end

@btime bench_iteration($stream)

3.169 s (25272097 allocations: 1.70 GiB)

Iterating over records takes 15 more time with Arrow.jl. Am I doing something wrong?

Sep 04 '22 09:09 bilelomrani1