gobblin
gobblin copied to clipboard
[GOBBLIN-2167] Allow filtering of Hive datasets by underlying HDFS folder location
trafficstars
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
- [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-2167
Description
- [ ] Here are some details about my PR, including screenshots (if applicable): Hive tables can be located in different folders in HDFS even if they belong to the same database. This becomes tricky to manage within a single Gobblin job especially when there are different permissions and handling based on underlying files for viewFS.
This PR adds a configuration to have a regex to filter tables based on their table location:
hive.dataset.tableFolderAllowlistFilter=<regex>
where tables with paths matching this filter will be selected, otherwise ignore
Tests
- [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason: Unit tests
Commits
- [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
- Subject is separated from body by a blank line
- Subject is limited to 50 characters
- Subject does not end with a period
- Subject uses the imperative mood ("add", not "adding")
- Body wraps at 72 characters
- Body explains "what" and "why", not "how"
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 45.36%. Comparing base (
45ad13e) to head (d4e60be). Report is 15 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #4069 +/- ##
============================================
+ Coverage 45.12% 45.36% +0.24%
+ Complexity 3199 3181 -18
============================================
Files 705 695 -10
Lines 26949 26587 -362
Branches 2680 2655 -25
============================================
- Hits 12160 12061 -99
+ Misses 13781 13523 -258
+ Partials 1008 1003 -5
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
LGTM!