parquet2 icon indicating copy to clipboard operation
parquet2 copied to clipboard

Deserialisation Error for Nested Types

Open sundeepks opened this issue 2 years ago • 2 comments

Hi while deserialising the parquet with nested types facing error, do we have the implementation for the following code snippet (got from the examples section)

Below code executes when page.descriptor.max_rep_level > 0, do we have the primitive_nested implementation for byte array ?



_ => match page.dictionary_page() {
            None => match physical_type {
                PhysicalType::Int64 => Ok(primitive_nested::page_to_array::<i64>(page)?),
                _ => {
                   todo!()
                }
            },
            Some(_) => match physical_type {
                PhysicalType::Int64 => Ok(primitive_nested::page_dict_to_array::<i64>(page)?),
                _ => {
                   todo!()
                }
            },
        },
 

sundeepks avatar May 01 '22 04:05 sundeepks

Hey!

I know of 2: one in arrow2 and one under tests/.

The general idea is:

  1. split the page buffer in rep,def,values

  2. attach 3 decoders, one for rep, one for def, one for values - the rep and def should be HybridRleDecoder; the values should be whatever encoding is being used for that (the nested logic is independent of the primitive type). Something like:

    let (rep_levels, def_levels, _) = split_buffer(page);
    
    let max_rep_level = page.descriptor.max_rep_level;
    let max_def_level = page.descriptor.max_def_level;
    
    let reps =
        HybridRleDecoder::new(rep_levels, get_bit_width(max_rep_level), page.num_values());
    let defs =
        HybridRleDecoder::new(def_levels, get_bit_width(max_def_level), page.num_values());
    
    let iter = reps.zip(defs);
    

    (see https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L271)

  3. advance the iterators and reconstruct the nested type according to the dremel logic. This depends on how the specific format stores nested types (e.g. Vec<Vec<i32>> vs Vec<i32> + offsets). See e.g. https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L391 for how arrow2 does it.

One important thing to remember is that the length of the rep and def iterators (page.num_values) is not the number of values in the values iterator. For example:

# [[0, None], [], [10]]
reps, defs = list(
    zip(
        *[
            (0, 2),  # 0
            (1, 1),  # 1
            (0, 0),  #
            (0, 2),  # 10
        ]
    )
)

the values in this case contain 2 entries (0 and 10), the rep and levels contain 4 each.

jorgecarleitao avatar May 01 '22 05:05 jorgecarleitao

Hey, Thanks for the response, I was referring to the one in the tests https://github.com/jorgecarleitao/parquet2/blob/fa6fa3ca3848c29d8efa80fbf42ee6a5a58cb077/tests/it/read/mod.rs.. Is it possible to complete the todo placeholder what you have in tests or any reference code so I can complete the todo part ?

sundeepks avatar May 01 '22 11:05 sundeepks