gobblin icon indicating copy to clipboard operation
gobblin copied to clipboard

[GOBBLIN-568] FsDataWriter lacks support for multiple files per partition

Open tilakpatidar opened this issue 7 years ago • 1 comments

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

  • [x] GOBBLIN-568 FsDataWriter lacks support for multiple files per partition using record count or block size

Description

  • [x] Currently, FsDataWriter does not support writing multiple files per partition. As a result, a huge file is getting generated per partition. If someone is using file formats such as parquet it is not advisable.
  • [x] org.apache.gobblin.writer.MultipleFilesFsDataWriter extends org.apache.gobblin.writer.FsDataWriter to provide a wrapper around org.apache.gobblin.writer.FsDataWriter so that new writers can be instantiated once a threshold of records per file is reached. This can be configured using ConfigurationKeys.WRITER_RECORDS_PER_FILE_THRESHOLD
  • [x] org.apache.gobblin.writer.MultipleFilesFsDataWriterBuilder provides a way to build such a writer. However, the user has to implement in the builder how to instantiate the new writer MultipleFilesFsDataWriterBuilder::getNewWriter.
  • [x] As an example to build such writers in future refer org.apache.gobblin.writer.ParquetMultipleFilesHdfsDataWriter and org.apache.gobblin.writer.ParquetMultipleFilesDataWriterBuilder. They demonstrate how to build a writer where new files will be opened whenever the record count threshold is reached.

Tests

  • [x] org.apache.gobblin.writer.MultipleFilesFsDataWriterTest

Commits

  • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

tilakpatidar avatar Aug 26 '18 14:08 tilakpatidar

Any updates on this PR?

tilakpatidar avatar Dec 04 '18 12:12 tilakpatidar