hudi
hudi copied to clipboard
[HUDI-1887] Setting default value to false for enabling schema post processor
What is the purpose of the pull request
https://issues.apache.org/jira/browse/HUDI-1343 was added so that default and nulls are handled properly. This fix is applicable only to deltastreamer flow. But later, we had a fix in spark ds layer which fixed the defaults and null values. So, the schema post processor is not really required anymore. Hence making the default value to disabled.
Brief change log
(for example:)
- Modify AnnotationLocation checkstyle rule in checkstyle.xml
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
- Added integration tests for end-to-end.
- Added HoodieClientWriteTest to verify the change.
- Manually verified the change by running a job locally.
Committer checklist
-
[ ] Has a corresponding JIRA in PR title & commit
-
[ ] Commit message is descriptive of the change
-
[ ] CI is green
-
[ ] Necessary doc changes done or have another open PR
-
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 52.90%. Comparing base (
c5220b9) to head (701640c). Report is 4136 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #2911 +/- ##
=========================================
Coverage 52.90% 52.90%
+ Complexity 3748 3746 -2
=========================================
Files 488 488
Lines 23574 23574
Branches 2510 2510
=========================================
Hits 12472 12472
- Misses 9998 9999 +1
+ Partials 1104 1103 -1
| Flag | Coverage Δ | |
|---|---|---|
| hudicli | 39.53% <ø> (ø) |
|
| hudiclient | ∅ <ø> (∅) |
|
| hudicommon | 50.38% <ø> (+0.01%) |
:arrow_up: |
| hudiflink | 58.98% <ø> (ø) |
|
| hudihadoopmr | 33.33% <ø> (ø) |
|
| hudisparkdatasource | 73.33% <ø> (ø) |
|
| hudisync | 46.73% <ø> (ø) |
|
| huditimelineservice | 64.36% <ø> (ø) |
|
| hudiutilities | 69.43% <100.00%> (-0.10%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...in/java/org/apache/hudi/utilities/UtilHelpers.java | 63.37% <100.00%> (-1.17%) |
:arrow_down: |
@nsivabalan are n't there any tests affected by this change? Also do we even need this post processor feature anymore? Should can the entire feature instead of just making it not enabled by default?
yeah, I am trying to understand the implications of removing the post processor. we may not have namespace in delta streamer's target schema if we remove the post processor. so what we need to understand is, what incase schema has namespace in hudi data files, and then delta streamer produces records w/ schema not having namespace. marking as WIP for now.
yes, Only reason we still wanted to have is, just incase some user wishes to migrate from a path of having namespace to a path of not having namespace. but if we can totally remove it, I am all for it.
@n3nash : could you think of any reason why we need to have this instead of removing it altogether.
@nsivabalan It's the same reason you mentioned. The post processor adds the namespace. So if users have written log files with AVRO name-spacing and we disable post process for them, it will break. But this only happens in a specific path of DeltaSync - can you point out which one ? From my interactions in the community, I can suggest whether that path is a frequently used path or not - if it's not, let's remove this complexity.
@n3nash : when there is no target schema set and so hudi falls back to RowBasedSchemaProvider for target shema.
I am marking this a non-blocker.
@vinothchandar : If you are good with the patch, I can rebase and land it.
CI report:
- 701640caeb25bea56d6a8c5847794ca56a2535c3 Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
We have added more SchemaPostProcessor implementation since this PR is posted, including ChainedSchemaPostProcessor, KafkaOffsetPostProcessor, DropColumnSchemaPostProcessor, and DeleteSupportSchemaPostProcessor, we cannot disable it by default any more. Closing this PR.