hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7267]fix dataSkipping with null column stats

Open KnightChess opened this issue 1 year ago • 2 comments

from the picture, csi will use parquet chunk block meta calculate min/max value, and save it to mdt col stat. For complex cols, such as info array<struct<name: string, age: int>> , parquet meta will contain only info.array.name, infor.array.age, but hudi will only calculate info column, so this meta in mdt will be null.

And if sql expression contain IsNotNull(info), the file will all be skip.

And consider common cols, which will be add in the future and old file will not contain this col, may cause some other question. So, make code logical clean, Check for null before evaluating the value:min/mav/nullValue. image

Change Logs

  • Check for null before evaluating the value:min/mav/nullValue

Impact

None

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

KnightChess avatar Dec 28 '23 05:12 KnightChess

I see related changes: https://github.com/apache/hudi/pull/10389

danny0405 avatar Dec 28 '23 07:12 danny0405

CI report:

  • d07bc703721ad554a2ada4c0da1697eb7bd1a996 Azure: CANCELED
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Dec 28 '23 12:12 hudi-bot

I see related changes: #10389

look like met the same problem, close this issue, @danny0405 thanks

KnightChess avatar Dec 29 '23 02:12 KnightChess