arrow2 icon indicating copy to clipboard operation
arrow2 copied to clipboard

Potential bug in reading lists from avro?

Open shaeqahmed opened this issue 1 year ago • 0 comments

Line of code: https://github.com/jorgecarleitao/arrow2/blob/c615095dbfb930a247c23423e36118843f71e1d2/src/io/avro/read/deserialize.rs#L136

Feel free to close this issue if my understanding is incorrect here. Referencing the Java Avro implementation, looks like array values are stored/read as blocks and can be read consecutively until a 0 is encountered. (https://github.com/apache/avro/blob/42822886c28ea74a744abb7e7a80a942c540faa5/lang/java/avro/src/main/java/org/apache/avro/io/Decoder.java#L203)

In the Avro deserialisation logic for reading in lists, looks like we are making a call to try_push_valid for each block in a list item, rather than just for the whole item (outside of the loop). Intuitively, it looks like this would be incorrect behavior since blocks are an Avro implementation detail and our validity bitmap is tracking Arrow types (e.g. a list item).

I wasn't able to easily create a list in Avro that is stored as multiple blocks, so as to validate my assumptions, but wanted to open this issue in case it is an obvious bug. Thank you

shaeqahmed avatar Sep 18 '22 18:09 shaeqahmed