gobblin [GOBBLIN-386] Optimize the memory usage of schema fetch in Hive Registration

[GOBBLIN-386] Optimize the memory usage of schema fetch in Hive Registration

Open autumnust opened this issue 7 years ago • 2 comments

This PR is addressing the problem when hive registration is blowing up the memory since the olde schema-fetching process requires (1)Getting all .avro files within a dataset folder (2). Sort all these files by their modification time.

The solution is to conduct DFS while comparing the timestamp on-the-fly.

JIRA

[x] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-386] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-386

Description

[x] Here are some details about my PR, including screenshots (if applicable):

Tests

[x] My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

[x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Jan 23 '18 23:01 autumnust

@ibuenros @erwa Can you help review ? Thanks.

Jan 23 '18 23:01 autumnust

Looks good to me.

Jan 25 '18 01:01 erwa

gobblin gobblin copied to clipboard

[GOBBLIN-386] Optimize the memory usage of schema fetch in Hive Registration

JIRA

Description

Tests

Commits

gobblin
gobblin copied to clipboard