gobblin icon indicating copy to clipboard operation
gobblin copied to clipboard

[GOBBLIN-2167] Allow filtering of Hive datasets by underlying HDFS folder location

Open Will-Lo opened this issue 1 year ago • 1 comments
trafficstars

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

  • [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
    • https://issues.apache.org/jira/browse/GOBBLIN-2167

Description

  • [ ] Here are some details about my PR, including screenshots (if applicable): Hive tables can be located in different folders in HDFS even if they belong to the same database. This becomes tricky to manage within a single Gobblin job especially when there are different permissions and handling based on underlying files for viewFS.

This PR adds a configuration to have a regex to filter tables based on their table location:

hive.dataset.tableFolderAllowlistFilter=<regex>

where tables with paths matching this filter will be selected, otherwise ignore

Tests

  • [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason: Unit tests

Commits

  • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Will-Lo avatar Oct 19 '24 03:10 Will-Lo

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 45.36%. Comparing base (45ad13e) to head (d4e60be). Report is 15 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4069      +/-   ##
============================================
+ Coverage     45.12%   45.36%   +0.24%     
+ Complexity     3199     3181      -18     
============================================
  Files           705      695      -10     
  Lines         26949    26587     -362     
  Branches       2680     2655      -25     
============================================
- Hits          12160    12061      -99     
+ Misses        13781    13523     -258     
+ Partials       1008     1003       -5     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Oct 19 '24 04:10 codecov-commenter

LGTM!

khandelwal-prateek avatar Oct 22 '24 03:10 khandelwal-prateek