velox icon indicating copy to clipboard operation
velox copied to clipboard

Better unit test for parquet in velox

Open qqibrow opened this issue 1 year ago • 5 comments

Description

Hi, we recently started evaluating velox and notice there are issues in parquet reader. We generate some parquet test files used in presto unit test and try read them using native parquet reader in velox. there are lots of test failures:

 parquet_test/testNestedMaps/test_12.parquet
 parquet_test/testNestedMaps/test_10.parquet
 parquet_test/testNestedMaps/test_18.parquet
 parquet_test/testNestedMaps/test_14.parquet
 parquet_test/testNestedMaps/test_16.parquet
 parquet_test/testArrayOfMapOfStruct/test_12.parquet
 parquet_test/testArrayOfMapOfStruct/test_10.parquet
 parquet_test/testArrayOfMapOfStruct/test_18.parquet
 parquet_test/testArrayOfMapOfStruct/test_14.parquet
 parquet_test/testArrayOfMapOfStruct/test_16.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_6.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_4.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_8.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_0.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_2.parquet
 parquet_test/testMapSchemas/test_46.parquet
 parquet_test/testMapSchemas/test_34.parquet
 parquet_test/testMapSchemas/test_12.parquet
 parquet_test/testMapSchemas/test_60.parquet
 parquet_test/testMapSchemas/test_36.parquet
 parquet_test/testMapSchemas/test_62.parquet
 parquet_test/testMapSchemas/test_10.parquet
 parquet_test/testMapSchemas/test_18.parquet
 parquet_test/testMapSchemas/test_32.parquet
 parquet_test/testMapSchemas/test_66.parquet
 parquet_test/testMapSchemas/test_14.parquet
 parquet_test/testMapSchemas/test_58.parquet
 parquet_test/testMapSchemas/test_38.parquet
 parquet_test/testMapSchemas/test_42.parquet
 parquet_test/testMapSchemas/test_30.parquet
 parquet_test/testMapSchemas/test_16.parquet
 parquet_test/testMapSchemas/test_64.parquet
 parquet_test/testMapOfSingleLevelArray/test_12.parquet
 parquet_test/testMapOfSingleLevelArray/test_10.parquet
 parquet_test/testMapOfSingleLevelArray/test_18.parquet
 parquet_test/testMapOfSingleLevelArray/test_14.parquet
 parquet_test/testMapOfSingleLevelArray/test_16.parquet
 parquet_test/testArrayOfMapOfArray/test_12.parquet
 parquet_test/testArrayOfMapOfArray/test_10.parquet
 parquet_test/testArrayOfMapOfArray/test_18.parquet
 parquet_test/testArrayOfMapOfArray/test_14.parquet
 parquet_test/testArrayOfMapOfArray/test_16.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_12.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_10.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_18.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_14.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_16.parquet
 parquet_test/testStructOfMaps/test_12.parquet
 parquet_test/testStructOfMaps/test_10.parquet
 parquet_test/testStructOfMaps/test_18.parquet
 parquet_test/testStructOfMaps/test_14.parquet
 parquet_test/testStructOfMaps/test_16.parquet
 parquet_test/testMapOfArrayKeys/test_12.parquet
 parquet_test/testMapOfArrayKeys/test_10.parquet
 parquet_test/testMapOfArrayKeys/test_18.parquet
 parquet_test/testMapOfArrayKeys/test_14.parquet
 parquet_test/testMapOfArrayKeys/test_16.parquet
 parquet_test/testMapOfStruct/test_12.parquet
 parquet_test/testMapOfStruct/test_10.parquet
 parquet_test/testMapOfStruct/test_18.parquet
 parquet_test/testMapOfStruct/test_14.parquet
 parquet_test/testMapOfStruct/test_16.parquet
 parquet_test/testMap/test_12.parquet
 parquet_test/testMap/test_10.parquet
 parquet_test/testMap/test_18.parquet
 parquet_test/testMap/test_14.parquet
 parquet_test/testMap/test_16.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_12.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_10.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_18.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_14.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_16.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_12.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_10.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_18.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_14.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_16.parquet
 parquet_test/testMapWithNullValues/test_12.parquet
 parquet_test/testMapWithNullValues/test_10.parquet
 parquet_test/testMapWithNullValues/test_18.parquet
 parquet_test/testMapWithNullValues/test_14.parquet
 parquet_test/testMapWithNullValues/test_16.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_12.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_10.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_18.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_14.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_16.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_6.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_4.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_8.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_0.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_2.parquet
 parquet_test/testMapOfArrayValues/test_12.parquet
 parquet_test/testMapOfArrayValues/test_10.parquet
 parquet_test/testMapOfArrayValues/test_18.parquet
 parquet_test/testMapOfArrayValues/test_14.parquet
 parquet_test/testMapOfArrayValues/test_16.parquet
 parquet_test/testArrayOfMaps/test_12.parquet
 parquet_test/testArrayOfMaps/test_10.parquet
 parquet_test/testArrayOfMaps/test_18.parquet
 parquet_test/testArrayOfMaps/test_14.parquet
 parquet_test/testArrayOfMaps/test_16.parquet 

( name testArrayOfMaps means the test files are generated from testArrayOfMaps() in AbstractTestParquetReader )

There are two errors: one is #7002 and the other one is:

E1107 05:44:17.083400 1412506 Exceptions.h:69] Line: ../../velox/dwio/parquet/reader/ParquetReader.cpp:282, Function:getParquetColumnInfo, Expression: schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED (OPTIONAL vs. REPEATED), Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (OPTIONAL vs. REPEATED)
Retriable: False
Expression: schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED
Function: getParquetColumnInfo
File: ../../velox/dwio/parquet/reader/ParquetReader.cpp
Line: 282

for the time being we only tested meta data part not touch data correctness yet.

Can we think of building a better unit test infra for parquet? Is there some unit test best practice we can learn from orc support in velox? e.g, leveraging existing parquet testing in presto project is a good direction. To better test complex types, there is even a customized hive parquet writer (e.g, SingleLevelArraySchemaConverter) helps surfacing issues (this is the reason why #7002 is blocked at first because it's hard to reproduce)

qqibrow avatar Nov 08 '23 23:11 qqibrow

We've implemented testing velox parquet using presto unit test. Here is the mini design doc: https://gist.github.com/qqibrow/689ed97b91cc0b58337be96a86291301

qqibrow avatar Nov 28 '23 23:11 qqibrow

Here are some of the issues we discovered:

add example file to reproduce: https://github.com/facebookincubator/velox/issues/7002 https://github.com/facebookincubator/velox/issues/7617 https://github.com/facebookincubator/velox/issues/7776 https://github.com/facebookincubator/velox/issues/7777 https://github.com/facebookincubator/velox/issues/7778 https://github.com/facebookincubator/velox/issues/7779

qqibrow avatar Nov 28 '23 23:11 qqibrow

Hi @qqibrow , we also encountered data incorrectness when reading Map types, wonder if you also met such bugs, thanks!

waitinfuture avatar Dec 19 '23 09:12 waitinfuture

@qqibrow Let's use this issue just for adding the new Velox unit tests, and use another separate issue for Presto Parquet reader tests for Velox. I created https://github.com/facebookincubator/velox/issues/8102 for that.

yingsu00 avatar Dec 19 '23 10:12 yingsu00

@qqibrow This is a very informative post, thanks! What's the status of Parquet support in Velox from your perspective?

xumingming avatar Mar 28 '24 06:03 xumingming