hudi
hudi copied to clipboard
[HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module
Change Logs
This PR makes the changes to replace most FileSystem, Path, and FileStatus usage with HoodieStorage, HoodieLocation and HoodieFileStatus (introduced in #10567) in hudi-common module, to remove dependency on Hadoop FS abstraction which is not essential to most Hudi core read and write logic.
This PR still keeps using the Hadoop FileSystem-based implementation under the hood. A follow-up PR will make HoodieStorage and I/O implementation pluggable.
The focus of this PR is to reduce the usage of Hadoop's FileSystem, Path, and FileStatus usage in hudi-common module, mainly to decouple the read path from these classes, so other modules may have leftovers (e.g., on the write path, Spark, etc., which has to rely on these classes) and bridging (e.g., new Path(hoodieLocation.toUri())) and type-casting code (e.g., (FileSystem) storage.getFileSystem()), which are expected. The clean-up and code improvements of other modules are deferred to HUDI-7363.
Here are details to attend to (especially for reviewers):
HoodieStoragereplacesFileSystemwherever the storage abstraction is needed only.HoodieTableMetaClientprovidesHoodieStoragewithgetHoodieStorage(), replacinggetFs().HoodieStoragecalls replaceFileSystemcalls. Here are a few differences to be reminded of:storage.createDirectory(path)replacesfs.mkdirs(location)storage.deleteFile(location)replacesfs.delete(path, false);storage.deleteDirectory(location)replacesfs.delete(path, true).HoodieStoragelist and glob calls return Java List instead of array.
HoodieLocationreplacesPathandCachingPathwherever thePathinstance is not particularly needed (e.g., serving file system view).HoodieLocationholds the URI instance inside which is supposed to bePathURI compatible. The transformation between two are done in: (1)Path->HoodieLocation:new HoodieLocation(path.toUri()), (2)HoodieLocationtoPath:new Path(hoodieLocation.toUri()).HoodieLocationFilterreplacesPathFilter.HoodieFileStatusPOJO replaces theFileStatus, by storing only necessary fields.
This is part of the effort to provide Hudi storage abstraction and decouple hudi-common from hadoop dependencies. For reference, the single big-change PR can be found here: #10360.
Impact
One step further to decouple hudi-common module from Hadoop dependencies.
Risk level
low
Documentation Update
N/A
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
Note to reviewer: commit 44e3347 is frozen now and I'll only add new commits for new changes and fixes to easier review. I'll also defer the rebasing and force-push until CI passes and the PR is approved.
I need to check a few more things before landing this PR.
The PR is rebased on the latest master and ready to land once CI passes.
CI report:
- 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
- 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN
- e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN
- 1fcbd76df84231fad6ab70db2378b169f071ee61 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build