arrow2 icon indicating copy to clipboard operation
arrow2 copied to clipboard

Column with empty dictionary but contains values causes panic

Open TurnOfACard opened this issue 2 years ago • 2 comments

I found an issue where a small dictionary offset (length of 1) in arrow2::io::parquet::read::deserialize::binary::basic::State::OptionalDictionary causes the parquet reader to panic.

I imagine restricting the slice bound is a fix. In general, there doesn't appear to be a way to propagate errors without some API changes:

  • the function the code is located in (extend_from_state) does not return a Result
  • either does its caller: extend_from_new_page

I assume that we should modify the return values of extend_from_state and extend_from_new_page?

https://github.com/jorgecarleitao/arrow2/blob/9a386637ccf36355e593bb04407cf31382532914/src/io/parquet/read/deserialize/binary/basic.rs#L331-L340

TurnOfACard avatar May 04 '22 23:05 TurnOfACard

Thanks for the report!

The invariant is that dict_offsets.len() equals page.num_values() + 1, from here. So, for offset.len() == 1, the dict page is empty? In that case the values page should also also be empty?

Do you have the parquet file available?

jorgecarleitao avatar May 05 '22 05:05 jorgecarleitao

Ah, it seems that helps explain part of it.

Long story short, I was using a file produced by arrow2. In a single column, the first PageType was PageType::DictionaryPage, and the second was PageType::DataPageV2. For the DictionaryPage, the compressed_page_size was == 9, but uncompressed_page_size was == 0 (which may be another problem). Encoding was RleDictionary.

As a result of this, parquet2::page::page_dict::deserialize was passed an empty buf with num_values == 0 (plausible if the uncompressed_page_size is zero) it will return (vec![], vec![0]).

This is ultimately passed to the code below, which will always try to slice dict_values which potentially may be zero-length.

https://github.com/jorgecarleitao/arrow2/blob/9a386637ccf36355e593bb04407cf31382532914/src/io/parquet/read/deserialize/binary/basic.rs#L331-L340

I am not sure why parquet2 exported a page header with a compressed_page_size of 9, and an uncompressed_page_size of 0, I will take a look at the code that produced it.

If I am interpreting this correctly, there is no guard for a malformed parquet file with an empty dictionary but provided values.

TurnOfACard avatar May 05 '22 12:05 TurnOfACard