velox icon indicating copy to clipboard operation
velox copied to clipboard

Parquet reader should use parent type for nested types

Open chliang71 opened this issue 1 year ago • 9 comments

When traversing the Parquet metadat and evaluating a placeholder node (node created to represent "array" and "map") currently, it decides whether it is array(list) or map based on the number of children. But seems for certain Parquet files, this is not necessarily accurate. We think the better check is to check on the parent of the placeholder node, and see if what is the type of the parent.

chliang71 avatar Nov 28 '23 22:11 chliang71

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 6f8c8c597c5e987b84f4bc73db388026e7ddedf8
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/661eb8612509bd0008aaf45e

netlify[bot] avatar Nov 28 '23 22:11 netlify[bot]

Hi @chliang71!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Nov 28 '23 22:11 facebook-github-bot

As discussed offline, I created https://github.com/facebookincubator/velox/issues/7929. @chliang71 will just add necessary tests to this PR, and will work on https://github.com/facebookincubator/velox/issues/7929 in a separate PR.

yingsu00 avatar Dec 08 '23 07:12 yingsu00

As discussed, will file followup PRs on the additonal changes. For this PR, added a unit test and addressed comments.

chliang71 avatar Dec 11 '23 22:12 chliang71

The CI log shows:

The following tests FAILED: 209 - velox_expression_test (Failed) 296 - velox_exec_test (Timeout)

However in my local test,

  • velox_expression_test was running fine and successfully finished
  • velox_exec_test finished in ~1 hour but got one test fail with this error: Expression: fd_ >= 0 (-1 vs. 0) open failure in LocalReadFile constructor, -1 /path/to/nowhere.orc No such file or directory. I have tried multiple times both locally and re-trigger test on this PR but the outcome has been the same. Neither issue seems to related to my change in this PR though. I'm inclined to there is some issue with the tests themselves @yingsu00 any insights on how should we proceed?

chliang71 avatar Jan 09 '24 18:01 chliang71

@chliang71 Can you please rebase? The Minio server installation was already fixed in trunk.

yingsu00 avatar Feb 13 '24 14:02 yingsu00

I had to remove the "_tuple" name check after which the tests seemed to pass. I guess the column name based check still needs to be revisited.

chliang71 avatar Feb 28 '24 00:02 chliang71

@yingsu00 mind to review the diff? Thanks

chliang71 avatar Mar 26 '24 17:03 chliang71

friendly ping @yingsu00 @majetideepak mind to review? Thanks

chliang71 avatar Apr 03 '24 22:04 chliang71

@chliang71 There is a red CI, can you please rebase and push?

yingsu00 avatar Apr 04 '24 11:04 yingsu00

I rebased and pushed again. @yingsu00 @xiaoxmeng @majetideepak

chliang71 avatar Apr 09 '24 21:04 chliang71

@chliang71 Will this fix the issue raised here https://github.com/facebookincubator/velox/issues/9242#issuecomment-2045352283? Thanks!

majetideepak avatar Apr 10 '24 01:04 majetideepak

@chliang71 Can such complex situations be handled?

message hive_schema {
  optional group test (LIST) {
    repeated group array (LIST) {
      repeated group array (LIST) {
        repeated group array (LIST) {
          repeated group array (LIST) {
            repeated group array (LIST) {
              repeated group array (LIST) {
                repeated group array (LIST) {
                  repeated group array (LIST) {
                    repeated group array (LIST) {
                      repeated group array (LIST) {
                        repeated group array (LIST) {
                          repeated int32 array;
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

8dukongjian avatar Apr 10 '24 05:04 8dukongjian

Use the following method https://github.com/facebookincubator/velox/issues/7478 to test multi-level nested array types. This fix cannot recognize multi-level nested array types.

test case:testSingleLevelSchemaNestedArrays presto type:array(array(array(integer))) velox type:ARRAY<INTEGER>

8dukongjian avatar Apr 10 '24 06:04 8dukongjian

Thanks for taking a look @majetideepak , @8dukongjian . Just to clarify, this change is not to address all the backward compatibility cases. This change is to address a bug in current impl, when inferring child type. We are approaching all the backward compatibility cases one by one.

The case that both of you raised, about having a nested lists, and more importantly, the leaf level raw type is marked repeated. Is not going to be covered by this change. It is another backward compatibility case that we are exactly looking into. I was discussing with @hitarth on this case and @hitarth is actively working on it. There will likely be a separate PR on this case soon.

chliang71 avatar Apr 10 '24 16:04 chliang71

@chliang71 Do we gracefully handle (error) other cases? I feel we might segfault with the change in this PR on a 2-level encoded list. See my comment here. https://github.com/facebookincubator/velox/pull/7774#discussion_r1558322797

majetideepak avatar Apr 10 '24 20:04 majetideepak

@majetideepak Thanks. We have merged this PR internally for a while, and from our observations, the other cases are still as is, unchanged. Specifically, the case brought up here (2-level, leaf level is marked repeated) is crashing due to segfault either with or without this PR. This case does not even touch any of the code changes in this PR I think, so is completely separate case. We are working on a separate fix for this case. (Also replied on the comment)

chliang71 avatar Apr 10 '24 23:04 chliang71

@Yuhta can you help review/land this PR? Thanks!

majetideepak avatar Apr 11 '24 01:04 majetideepak

Thanks for taking a look @Yuhta .

I updated with VELOX_CHECK_GE(children.size(), 1) for list and VELOX_CHECK_EQ(children.size(), 2) for map. For list, there can be more than one child here actually, because this is a middle layer node (the 2nd case of backward compatibility case as in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists). Which is one reason we started with making this change.

chliang71 avatar Apr 16 '24 18:04 chliang71

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 17 '24 14:04 facebook-github-bot

@Yuhta merged this pull request in facebookincubator/velox@720787a2da1bf0ed6ea647c8f589e4fc0d99c459.

facebook-github-bot avatar Apr 17 '24 20:04 facebook-github-bot

Conbench analyzed the 1 benchmark run on commit 720787a2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

conbench-facebook[bot] avatar Apr 17 '24 21:04 conbench-facebook[bot]