arrow2
arrow2 copied to clipboard
Added writing FixedSizeList to parquet
First contribution, let me know if you'd like me to add anything else.
Closes #1144
Codecov Report
Merging #1147 (9e10c2c) into main (e4947cc) will decrease coverage by
0.01%
. The diff coverage is0.00%
.
@@ Coverage Diff @@
## main #1147 +/- ##
==========================================
- Coverage 83.62% 83.61% -0.02%
==========================================
Files 366 366
Lines 35906 35910 +4
==========================================
- Hits 30028 30025 -3
- Misses 5878 5885 +7
Impacted Files | Coverage Δ | |
---|---|---|
src/io/parquet/write/pages.rs | 94.92% <0.00%> (-0.93%) |
:arrow_down: |
src/array/utf8/mod.rs | 85.62% <0.00%> (-0.96%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update e4947cc...9e10c2c. Read the comment docs.
Hi @thesmartwon , thanks a lot for the PR!
Note that we need to be a bit careful with the fixed-sized list when it comes to nulls - given a fixed-size list [[1, 2], None, [3, 4]]
, arrow's values are [1, 2, 0, 0, 3, 4]
, while parquet expects the encoded values to be [1, 2, 3, 4]
- i.e. it is a bit more tricky than a list array, unfortunately.
Two suggestions:
- in this PR, add
generated_nested
at https://github.com/jorgecarleitao/arrow2/blob/main/arrow-parquet-integration-testing/main.py#L68 and try to run the test to check that pyarrow can read the written file - as a separate PR, implement
cast(FixedSizeList, List<i32>)
around here. This is imo a good exercise to understand the complexity of writingFixedSizeList
to parquet, since evenarrow(FixedSizeList) -> arrow(List)
is tricky as it may require some form of "packing".
Hey @jorgecarleitao, curious about the state of this PR and how I can help move it forward.
- as a separate PR, implement
cast(FixedSizeList, List<i32>)
It looks like that was implemented in https://github.com/jorgecarleitao/arrow2/pull/1281?
Took a quick look at this. First, I was able to check that this PR does in fact create invalid files. The integration test fails with
File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Received invalid number of bytes (corrupt data page?)
I can conceptually understand how the packing needs to happen, but I'm a little lost how to do this with a dynamic child type. I was thinking of essentially
- provision output variable size list array given fixed size list's array dimensions
- iterate over values of the fixed size list, filling the variable size list whenever the values are non-null
- Shrink the new array to whatever length was actually filled
The problem is that to do that I need to create a strongly typed Mutable*Array
, right? So is there a way to do this apart from a huge match
statement on the data type?
Now that I think about it, could we re-use the take
kernel here? Essentially the newly-packed values
array would be a take
with all the valid values? Then a first step would be to implement take
on FixedSizeList
?
I was looking at #1281 and noticed that the implementation of fixed size list -> ListArray
Similar to the layout of variable-size binary, a null value may correspond to a non-empty segment in the child array. When this is true, the content of the corresponding segment can be arbitrary.
I'm no longer using Arrow and it looks like this sparked enough discussion to fix the issue. Closing.