hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module

Open yihua opened this issue 1 year ago • 2 comments

Change Logs

This PR makes the changes to replace most FileSystem, Path, and FileStatus usage with HoodieStorage, HoodieLocation and HoodieFileStatus (introduced in #10567) in hudi-common module, to remove dependency on Hadoop FS abstraction which is not essential to most Hudi core read and write logic.

This PR still keeps using the Hadoop FileSystem-based implementation under the hood. A follow-up PR will make HoodieStorage and I/O implementation pluggable.

The focus of this PR is to reduce the usage of Hadoop's FileSystem, Path, and FileStatus usage in hudi-common module, mainly to decouple the read path from these classes, so other modules may have leftovers (e.g., on the write path, Spark, etc., which has to rely on these classes) and bridging (e.g., new Path(hoodieLocation.toUri())) and type-casting code (e.g., (FileSystem) storage.getFileSystem()), which are expected. The clean-up and code improvements of other modules are deferred to HUDI-7363.

Here are details to attend to (especially for reviewers):

  • HoodieStorage replaces FileSystem wherever the storage abstraction is needed only. HoodieTableMetaClient provides HoodieStorage with getHoodieStorage(), replacing getFs().
  • HoodieStorage calls replace FileSystem calls. Here are a few differences to be reminded of:
    • storage.createDirectory(path) replaces fs.mkdirs(location)
    • storage.deleteFile(location) replaces fs.delete(path, false); storage.deleteDirectory(location) replaces fs.delete(path, true).
    • HoodieStorage list and glob calls return Java List instead of array.
  • HoodieLocation replaces Path and CachingPath wherever the Path instance is not particularly needed (e.g., serving file system view). HoodieLocation holds the URI instance inside which is supposed to be Path URI compatible. The transformation between two are done in: (1) Path -> HoodieLocation: new HoodieLocation(path.toUri()), (2) HoodieLocation to Path: new Path(hoodieLocation.toUri()).
  • HoodieLocationFilter replaces PathFilter.
  • HoodieFileStatus POJO replaces the FileStatus, by storing only necessary fields.

This is part of the effort to provide Hudi storage abstraction and decouple hudi-common from hadoop dependencies. For reference, the single big-change PR can be found here: #10360.

Impact

One step further to decouple hudi-common module from Hadoop dependencies.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

yihua avatar Jan 31 '24 02:01 yihua

Note to reviewer: commit 44e3347 is frozen now and I'll only add new commits for new changes and fixes to easier review. I'll also defer the rebasing and force-push until CI passes and the PR is approved.

yihua avatar Feb 01 '24 02:02 yihua

I need to check a few more things before landing this PR.

yihua avatar Apr 18 '24 02:04 yihua

The PR is rebased on the latest master and ready to land once CI passes.

yihua avatar Apr 18 '24 16:04 yihua

CI report:

  • 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
  • 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN
  • e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN
  • 1fcbd76df84231fad6ab70db2378b169f071ee61 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Apr 18 '24 17:04 hudi-bot