gobblin icon indicating copy to clipboard operation
gobblin copied to clipboard

[GOBBLIN-803] HivePartition record count

Open autumnust opened this issue 6 years ago • 1 comments

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

  • [x] My PR addresses the following [Gobblin JIRA]
    • https://issues.apache.org/jira/browse/GOBBLIN-803

Description

  • [x] Here are some details about my PR, including screenshots (if applicable):

  • Collection of record count metadata: Each WorkUnit contains the high watermark and low watermark for a kafka partition and that is equivalent to the number of events handled by that workunit. Given each workunit will output only single HDFS file under a path to be registered, we maintained a map(pathToRecordCount) from to-register path to the number of records it contains.

  • Whenever a HiveSpec is logically representing a partition (spec.getPartition.isPresent equals to true), it will be updated with record count information collected in the map pathToRecordCount before handed over to hiveRegister.

  • Add a unit test to examine correctness of record counting.

Tests

  • [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

autumnust avatar Jun 11 '19 21:06 autumnust

@ibuenros Can you help review ? Thanks

autumnust avatar Jun 11 '19 21:06 autumnust