beam icon indicating copy to clipboard operation
beam copied to clipboard

Generalize file based source output capabilities

Open prodriguezdefino opened this issue 1 year ago • 21 comments

Currently, when using a file based source implementation to read data from files we have 2 output options:

  • read only the content of the each line of each file into a PCollection
  • read the content of a file, key each line in the file with the file name and put it into a PCollection (through direct usage of the ReadAllViaFileBasedSourceWithFilename PTransform)

While this cover a large bulk of the desired read operations, we can generalize the current implementation to allow a much larger space of possibilities for the users.

By introducing a serializable lambda as a parameter for the ReadAllViaFileBasedSource class, and a data carrier adapter class to simplify the makeOutput method arguments with that one of the provided lambda, we can enable users to make new decisions on how to emit the data read from particular files by making more information accessible about those files their data was extracted from.

This change adds such generalization to the ReadAllViaFileBasedSource class, a new factory method to TextIO PTransform (along side documentation on how to use it) and tests that corroborate the changes made.

Subsequent PRs can be done to propagate the functionality to other file base sources (Avro, Parquet, etc).


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [ ] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

prodriguezdefino avatar Nov 29 '23 23:11 prodriguezdefino

Run Java PreCommit

prodriguezdefino avatar Nov 30 '23 06:11 prodriguezdefino

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions[bot] avatar Nov 30 '23 07:11 github-actions[bot]

assign set of reviewers

prodriguezdefino avatar Nov 30 '23 07:11 prodriguezdefino

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @bvolpato for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Nov 30 '23 07:11 github-actions[bot]

Run Java_IOs_Direct PreCommit

prodriguezdefino avatar Dec 05 '23 20:12 prodriguezdefino

fixes #29627

prodriguezdefino avatar Dec 06 '23 02:12 prodriguezdefino

Run Java PreCommit

prodriguezdefino avatar Dec 06 '23 04:12 prodriguezdefino

This is a really interesting enhancement, but it goes to the core of a Beam transform. Do you have a design document for this? I want to make sure the community sees the trade offs here before we go too far into implementation.

johnjcasey avatar Dec 06 '23 16:12 johnjcasey

Reminder, please take a look at this pr: @bvolpato @ahmedabu98

github-actions[bot] avatar Dec 14 '23 12:12 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java. R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Dec 18 '23 12:12 github-actions[bot]

Reminder, please take a look at this pr: @kennknowles @johnjcasey

github-actions[bot] avatar Dec 26 '23 12:12 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java. R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Dec 28 '23 12:12 github-actions[bot]

Reminder, please take a look at this pr: @damondouglas @damondouglas

github-actions[bot] avatar Jan 05 '24 12:01 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java. R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Jan 09 '24 12:01 github-actions[bot]

Reminder, please take a look at this pr: @robertwb @johnjcasey

github-actions[bot] avatar Jan 16 '24 12:01 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Jan 19 '24 12:01 github-actions[bot]

Reminder, please take a look at this pr: @Abacn @ahmedabu98

github-actions[bot] avatar Jan 27 '24 12:01 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java. R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Jan 31 '24 12:01 github-actions[bot]

Good day, @prodriguezdefino and thank you for contributing! I will be taking a look at this and placed it in my review queue.

damondouglas avatar Jan 31 '24 19:01 damondouglas

Reminder, please take a look at this pr: @damondouglas @damondouglas

github-actions[bot] avatar Feb 13 '24 12:02 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Feb 16 '24 12:02 github-actions[bot]

Reminder, please take a look at this pr: @robertwb @ahmedabu98

github-actions[bot] avatar Feb 24 '24 12:02 github-actions[bot]

This sounds like a good change.

Formally adding R: @damondouglas for follow-up.

robertwb avatar Feb 26 '24 16:02 robertwb

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

github-actions[bot] avatar Feb 26 '24 16:02 github-actions[bot]

Good day, @prodriguezdefino, Is this PR still active? Would you be able to investigate the two failing checks? Please let me know if you need anything and thank you again.

damondouglas avatar Feb 27 '24 17:02 damondouglas

Sorry on the delay, checking on this now.

Seems that "Run Java PreCommit" got cancelled after 3 hrs execution, and "Run Java_GCP_IO_Direct PreCommit" failed on a Spanner related test.

I'm triggering both again.

prodriguezdefino avatar Feb 27 '24 18:02 prodriguezdefino

Run Java PreCommit

prodriguezdefino avatar Feb 27 '24 18:02 prodriguezdefino

Run Java_GCP_IO_Direct PreCommit

prodriguezdefino avatar Feb 27 '24 18:02 prodriguezdefino

Run Java_IO_Direct PreCommit

prodriguezdefino avatar Feb 27 '24 20:02 prodriguezdefino

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Jun 15 '24 12:06 github-actions[bot]