arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved

Open complyue opened this issue 2 years ago • 0 comments

https://github.com/apache/arrow-julia/blob/614fce0a5d7db8fee078be32690c5220848538e2/src/table.jl#L276-L293

I see from above that record batches will be parsed (esp. decompression could be rather intensive computation workload) in parallel if the Julia runtime has multithread enabled, which is great.

But according to the implementation, the original order of batches as they had been written will not be guaranteed as preserved, which I think is not ideal. I'm not sure how Arrow spec should say about this aspect, but I'm dealing with time series data recorded batch-by-batch where the order signifies a lot.

I'd like to draft a PR to preserve batch order with regard to this concern, and as I start tinkering with the codebase, I file this issue to ask your opinions about it.

(Btw, I'm also tinkering about a PR for #293, which is orthogonal wrt functionality, but seems closely related wrt implementation details. I'd think 2 separate PRs would make better clarity for review and release purpose, but if you can accept a single PR addressing the 2 things together, it could be a lot easier for me, given I'm not fluent in git rebasing and related skills.)

complyue avatar Mar 04 '22 08:03 complyue