arrow2 icon indicating copy to clipboard operation
arrow2 copied to clipboard

Added writing FixedSizeList to parquet

Open thesmartwon opened this issue 2 years ago • 2 comments

First contribution, let me know if you'd like me to add anything else.

Closes #1144

thesmartwon avatar Jul 08 '22 00:07 thesmartwon

Codecov Report

Merging #1147 (9e10c2c) into main (e4947cc) will decrease coverage by 0.01%. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #1147      +/-   ##
==========================================
- Coverage   83.62%   83.61%   -0.02%     
==========================================
  Files         366      366              
  Lines       35906    35910       +4     
==========================================
- Hits        30028    30025       -3     
- Misses       5878     5885       +7     
Impacted Files Coverage Δ
src/io/parquet/write/pages.rs 94.92% <0.00%> (-0.93%) :arrow_down:
src/array/utf8/mod.rs 85.62% <0.00%> (-0.96%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e4947cc...9e10c2c. Read the comment docs.

codecov[bot] avatar Jul 08 '22 05:07 codecov[bot]

Hi @thesmartwon , thanks a lot for the PR!

Note that we need to be a bit careful with the fixed-sized list when it comes to nulls - given a fixed-size list [[1, 2], None, [3, 4]], arrow's values are [1, 2, 0, 0, 3, 4], while parquet expects the encoded values to be [1, 2, 3, 4] - i.e. it is a bit more tricky than a list array, unfortunately.

Two suggestions:

  1. in this PR, add generated_nested at https://github.com/jorgecarleitao/arrow2/blob/main/arrow-parquet-integration-testing/main.py#L68 and try to run the test to check that pyarrow can read the written file
  2. as a separate PR, implement cast(FixedSizeList, List<i32>) around here. This is imo a good exercise to understand the complexity of writing FixedSizeList to parquet, since even arrow(FixedSizeList) -> arrow(List) is tricky as it may require some form of "packing".

jorgecarleitao avatar Jul 09 '22 16:07 jorgecarleitao

Hey @jorgecarleitao, curious about the state of this PR and how I can help move it forward.

  1. as a separate PR, implement cast(FixedSizeList, List<i32>)

It looks like that was implemented in https://github.com/jorgecarleitao/arrow2/pull/1281?

kylebarron avatar Feb 06 '23 19:02 kylebarron

Took a quick look at this. First, I was able to check that this PR does in fact create invalid files. The integration test fails with

  File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Received invalid number of bytes (corrupt data page?)

I can conceptually understand how the packing needs to happen, but I'm a little lost how to do this with a dynamic child type. I was thinking of essentially

  1. provision output variable size list array given fixed size list's array dimensions
  2. iterate over values of the fixed size list, filling the variable size list whenever the values are non-null
  3. Shrink the new array to whatever length was actually filled

The problem is that to do that I need to create a strongly typed Mutable*Array, right? So is there a way to do this apart from a huge match statement on the data type?


Now that I think about it, could we re-use the take kernel here? Essentially the newly-packed values array would be a take with all the valid values? Then a first step would be to implement take on FixedSizeList?


I was looking at #1281 and noticed that the implementation of fixed size list -> ListArray doesn't actually do any repacking. I thought this meant the array was invalid but on second look, the spec does in fact allow this

Similar to the layout of variable-size binary, a null value may correspond to a non-empty segment in the child array. When this is true, the content of the corresponding segment can be arbitrary.

kylebarron avatar Feb 07 '23 04:02 kylebarron

I'm no longer using Arrow and it looks like this sparked enough discussion to fix the issue. Closing.

thesmartwon avatar Mar 25 '23 22:03 thesmartwon