arrow-julia
arrow-julia copied to clipboard
Streaming: Pyarrow is 15 times faster than Arrow.jl
I have an .arrow file generated with pyarrow whose schema is the following:
input: struct<open: fixed_size_list<item: float>[512], high: fixed_size_list<item: float>[512], low: fixed_size_list<item: float>[512], close: fixed_size_list<item: float>[512]> not null
child 0, open: fixed_size_list<item: float>[512]
child 0, item: float
child 1, high: fixed_size_list<item: float>[512]
child 0, item: float
child 2, low: fixed_size_list<item: float>[512]
child 0, item: float
child 3, close: fixed_size_list<item: float>[512]
child 0, item: float
With pyarrow, I load and iterate over records with the following:
with pa.memory_map('arraydata.arrow', 'r') as source:
loaded_arrays = pa.ipc.open_file(source).read_all()
a = 0
for batch in loaded_arrays.to_batches():
for input_candles in batch["input"]:
a += 1
Iterating over my example file (~10,000 lines) takes 210 ms.
In julia, I load and iterate over the same file with the following:
stream = Arrow.Stream("./arraydata.arrow")
function bench_iteration(stream)
a = 0
for batch in stream
for sample in batch.input
a += 1
end
end
end
@btime bench_iteration($stream)
3.169 s (25272097 allocations: 1.70 GiB)
Iterating over records takes 15 more time with Arrow.jl. Am I doing something wrong?