parquet2
parquet2 copied to clipboard
Deserialisation Error for Nested Types
Hi while deserialising the parquet with nested types facing error, do we have the implementation for the following code snippet (got from the examples section)
Below code executes when page.descriptor.max_rep_level > 0, do we have the primitive_nested implementation for byte array ?
_ => match page.dictionary_page() {
None => match physical_type {
PhysicalType::Int64 => Ok(primitive_nested::page_to_array::<i64>(page)?),
_ => {
todo!()
}
},
Some(_) => match physical_type {
PhysicalType::Int64 => Ok(primitive_nested::page_dict_to_array::<i64>(page)?),
_ => {
todo!()
}
},
},
Hey!
I know of 2: one in arrow2 and one under tests/.
The general idea is:
-
split the page buffer in
rep,def,values
-
attach 3 decoders, one for
rep
, one fordef
, one forvalues
- therep
anddef
should beHybridRleDecoder
; thevalues
should be whatever encoding is being used for that (the nested logic is independent of the primitive type). Something like:let (rep_levels, def_levels, _) = split_buffer(page); let max_rep_level = page.descriptor.max_rep_level; let max_def_level = page.descriptor.max_def_level; let reps = HybridRleDecoder::new(rep_levels, get_bit_width(max_rep_level), page.num_values()); let defs = HybridRleDecoder::new(def_levels, get_bit_width(max_def_level), page.num_values()); let iter = reps.zip(defs);
(see https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L271)
-
advance the iterators and reconstruct the nested type according to the dremel logic. This depends on how the specific format stores nested types (e.g.
Vec<Vec<i32>>
vsVec<i32> + offsets
). See e.g. https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L391 for how arrow2 does it.
One important thing to remember is that the length of the rep
and def
iterators (page.num_values
) is not the number of values in the values
iterator. For example:
# [[0, None], [], [10]]
reps, defs = list(
zip(
*[
(0, 2), # 0
(1, 1), # 1
(0, 0), #
(0, 2), # 10
]
)
)
the values in this case contain 2 entries (0 and 10), the rep and levels contain 4 each.
Hey, Thanks for the response, I was referring to the one in the tests https://github.com/jorgecarleitao/parquet2/blob/fa6fa3ca3848c29d8efa80fbf42ee6a5a58cb077/tests/it/read/mod.rs.. Is it possible to complete the todo placeholder what you have in tests or any reference code so I can complete the todo part ?