velox
velox copied to clipboard
Parquet reader should use parent type for nested types
When traversing the Parquet metadat and evaluating a placeholder node (node created to represent "array" and "map") currently, it decides whether it is array(list) or map based on the number of children. But seems for certain Parquet files, this is not necessarily accurate. We think the better check is to check on the parent of the placeholder node, and see if what is the type of the parent.
Deploy Preview for meta-velox canceled.
Name | Link |
---|---|
Latest commit | 6f8c8c597c5e987b84f4bc73db388026e7ddedf8 |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/661eb8612509bd0008aaf45e |
Hi @chliang71!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed
. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
As discussed offline, I created https://github.com/facebookincubator/velox/issues/7929. @chliang71 will just add necessary tests to this PR, and will work on https://github.com/facebookincubator/velox/issues/7929 in a separate PR.
As discussed, will file followup PRs on the additonal changes. For this PR, added a unit test and addressed comments.
The CI log shows:
The following tests FAILED: 209 - velox_expression_test (Failed) 296 - velox_exec_test (Timeout)
However in my local test,
- velox_expression_test was running fine and successfully finished
- velox_exec_test finished in ~1 hour but got one test fail with this error:
Expression: fd_ >= 0 (-1 vs. 0) open failure in LocalReadFile constructor, -1 /path/to/nowhere.orc No such file or directory.
I have tried multiple times both locally and re-trigger test on this PR but the outcome has been the same. Neither issue seems to related to my change in this PR though. I'm inclined to there is some issue with the tests themselves @yingsu00 any insights on how should we proceed?
@chliang71 Can you please rebase? The Minio server installation was already fixed in trunk.
I had to remove the "_tuple" name check after which the tests seemed to pass. I guess the column name based check still needs to be revisited.
@yingsu00 mind to review the diff? Thanks
friendly ping @yingsu00 @majetideepak mind to review? Thanks
@chliang71 There is a red CI, can you please rebase and push?
I rebased and pushed again. @yingsu00 @xiaoxmeng @majetideepak
@chliang71 Will this fix the issue raised here https://github.com/facebookincubator/velox/issues/9242#issuecomment-2045352283? Thanks!
@chliang71 Can such complex situations be handled?
message hive_schema {
optional group test (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated group array (LIST) {
repeated int32 array;
}
}
}
}
}
}
}
}
}
}
}
}
}
Use the following method https://github.com/facebookincubator/velox/issues/7478 to test multi-level nested array types. This fix cannot recognize multi-level nested array types.
test case:testSingleLevelSchemaNestedArrays
presto type:array(array(array(integer)))
velox type:ARRAY<INTEGER>
Thanks for taking a look @majetideepak , @8dukongjian . Just to clarify, this change is not to address all the backward compatibility cases. This change is to address a bug in current impl, when inferring child type. We are approaching all the backward compatibility cases one by one.
The case that both of you raised, about having a nested lists, and more importantly, the leaf level raw type is marked repeated. Is not going to be covered by this change. It is another backward compatibility case that we are exactly looking into. I was discussing with @hitarth on this case and @hitarth is actively working on it. There will likely be a separate PR on this case soon.
@chliang71 Do we gracefully handle (error) other cases? I feel we might segfault with the change in this PR on a 2-level encoded list. See my comment here. https://github.com/facebookincubator/velox/pull/7774#discussion_r1558322797
@majetideepak Thanks. We have merged this PR internally for a while, and from our observations, the other cases are still as is, unchanged. Specifically, the case brought up here (2-level, leaf level is marked repeated) is crashing due to segfault either with or without this PR. This case does not even touch any of the code changes in this PR I think, so is completely separate case. We are working on a separate fix for this case. (Also replied on the comment)
@Yuhta can you help review/land this PR? Thanks!
Thanks for taking a look @Yuhta .
I updated with VELOX_CHECK_GE(children.size(), 1)
for list and VELOX_CHECK_EQ(children.size(), 2)
for map. For list, there can be more than one child here actually, because this is a middle layer node (the 2nd case of backward compatibility case as in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists). Which is one reason we started with making this change.
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@Yuhta merged this pull request in facebookincubator/velox@720787a2da1bf0ed6ea647c8f589e4fc0d99c459.
Conbench analyzed the 1 benchmark run on commit 720787a2
.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.