hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-1887] Setting default value to false for enabling schema post processor

Open nsivabalan opened this issue 4 years ago • 10 comments

What is the purpose of the pull request

https://issues.apache.org/jira/browse/HUDI-1343 was added so that default and nulls are handled properly. This fix is applicable only to deltastreamer flow. But later, we had a fix in spark ds layer which fixed the defaults and null values. So, the schema post processor is not really required anymore. Hence making the default value to disabled.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • [ ] Has a corresponding JIRA in PR title & commit

  • [ ] Commit message is descriptive of the change

  • [ ] CI is green

  • [ ] Necessary doc changes done or have another open PR

  • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan avatar May 02 '21 03:05 nsivabalan

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 52.90%. Comparing base (c5220b9) to head (701640c). Report is 4136 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2911   +/-   ##
=========================================
  Coverage     52.90%   52.90%           
+ Complexity     3748     3746    -2     
=========================================
  Files           488      488           
  Lines         23574    23574           
  Branches       2510     2510           
=========================================
  Hits          12472    12472           
- Misses         9998     9999    +1     
+ Partials       1104     1103    -1     
Flag Coverage Δ
hudicli 39.53% <ø> (ø)
hudiclient ∅ <ø> (∅)
hudicommon 50.38% <ø> (+0.01%) :arrow_up:
hudiflink 58.98% <ø> (ø)
hudihadoopmr 33.33% <ø> (ø)
hudisparkdatasource 73.33% <ø> (ø)
hudisync 46.73% <ø> (ø)
huditimelineservice 64.36% <ø> (ø)
hudiutilities 69.43% <100.00%> (-0.10%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...in/java/org/apache/hudi/utilities/UtilHelpers.java 63.37% <100.00%> (-1.17%) :arrow_down:

... and 1 file with indirect coverage changes

codecov-commenter avatar May 02 '21 03:05 codecov-commenter

@nsivabalan are n't there any tests affected by this change? Also do we even need this post processor feature anymore? Should can the entire feature instead of just making it not enabled by default?

vinothchandar avatar May 11 '21 01:05 vinothchandar

yeah, I am trying to understand the implications of removing the post processor. we may not have namespace in delta streamer's target schema if we remove the post processor. so what we need to understand is, what incase schema has namespace in hudi data files, and then delta streamer produces records w/ schema not having namespace. marking as WIP for now.

nsivabalan avatar May 14 '21 04:05 nsivabalan

yes, Only reason we still wanted to have is, just incase some user wishes to migrate from a path of having namespace to a path of not having namespace. but if we can totally remove it, I am all for it.

nsivabalan avatar Jun 11 '21 04:06 nsivabalan

@n3nash : could you think of any reason why we need to have this instead of removing it altogether.

nsivabalan avatar Jun 11 '21 04:06 nsivabalan

@nsivabalan It's the same reason you mentioned. The post processor adds the namespace. So if users have written log files with AVRO name-spacing and we disable post process for them, it will break. But this only happens in a specific path of DeltaSync - can you point out which one ? From my interactions in the community, I can suggest whether that path is a frequently used path or not - if it's not, let's remove this complexity.

n3nash avatar Jul 02 '21 06:07 n3nash

@n3nash : when there is no target schema set and so hudi falls back to RowBasedSchemaProvider for target shema.

nsivabalan avatar Jul 02 '21 20:07 nsivabalan

I am marking this a non-blocker.

vinothchandar avatar Aug 04 '21 19:08 vinothchandar

@vinothchandar : If you are good with the patch, I can rebase and land it.

nsivabalan avatar Aug 05 '21 02:08 nsivabalan

CI report:

  • 701640caeb25bea56d6a8c5847794ca56a2535c3 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Nov 05 '21 02:11 hudi-bot

We have added more SchemaPostProcessor implementation since this PR is posted, including ChainedSchemaPostProcessor, KafkaOffsetPostProcessor, DropColumnSchemaPostProcessor, and DeleteSupportSchemaPostProcessor, we cannot disable it by default any more. Closing this PR.

yihua avatar Sep 10 '24 15:09 yihua