beam
beam copied to clipboard
Generalize file based source output capabilities
Currently, when using a file based source implementation to read data from files we have 2 output options:
- read only the content of the each line of each file into a PCollection
- read the content of a file, key each line in the file with the file name and put it into a PCollection (through direct usage of the
ReadAllViaFileBasedSourceWithFilename
PTransform)
While this cover a large bulk of the desired read operations, we can generalize the current implementation to allow a much larger space of possibilities for the users.
By introducing a serializable lambda as a parameter for the ReadAllViaFileBasedSource
class, and a data carrier adapter class to simplify the makeOutput method arguments with that one of the provided lambda, we can enable users to make new decisions on how to emit the data read from particular files by making more information accessible about those files their data was extracted from.
This change adds such generalization to the ReadAllViaFileBasedSource
class, a new factory method to TextIO
PTransform (along side documentation on how to use it) and tests that corroborate the changes made.
Subsequent PRs can be done to propagate the functionality to other file base sources (Avro, Parquet, etc).
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
- [ ] Mention the appropriate issue in your description (for example:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead. - [ ] Update
CHANGES.md
with noteworthy changes. - [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.
See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.
Run Java PreCommit
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers
assign set of reviewers
Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer
:
R: @bvolpato for label java. R: @ahmedabu98 for label io.
Available commands:
-
stop reviewer notifications
- opt out of the automated review tooling -
remind me after tests pass
- tag the comment author after tests pass -
waiting on author
- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
The PR bot will only process comments in the main thread (not review comments).
Run Java_IOs_Direct PreCommit
fixes #29627
Run Java PreCommit
This is a really interesting enhancement, but it goes to the core of a Beam transform. Do you have a design document for this? I want to make sure the community sees the trade offs here before we go too far into implementation.
Reminder, please take a look at this pr: @bvolpato @ahmedabu98
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer
:
R: @kennknowles for label java. R: @johnjcasey for label io.
Available commands:
-
stop reviewer notifications
- opt out of the automated review tooling -
remind me after tests pass
- tag the comment author after tests pass -
waiting on author
- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @kennknowles @johnjcasey
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer
:
R: @damondouglas for label java. R: @damondouglas for label io.
Available commands:
-
stop reviewer notifications
- opt out of the automated review tooling -
remind me after tests pass
- tag the comment author after tests pass -
waiting on author
- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @damondouglas @damondouglas
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer
:
R: @robertwb for label java. R: @johnjcasey for label io.
Available commands:
-
stop reviewer notifications
- opt out of the automated review tooling -
remind me after tests pass
- tag the comment author after tests pass -
waiting on author
- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @robertwb @johnjcasey
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer
:
R: @Abacn for label java. R: @ahmedabu98 for label io.
Available commands:
-
stop reviewer notifications
- opt out of the automated review tooling -
remind me after tests pass
- tag the comment author after tests pass -
waiting on author
- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @Abacn @ahmedabu98
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer
:
R: @damondouglas for label java. R: @damondouglas for label io.
Available commands:
-
stop reviewer notifications
- opt out of the automated review tooling -
remind me after tests pass
- tag the comment author after tests pass -
waiting on author
- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Good day, @prodriguezdefino and thank you for contributing! I will be taking a look at this and placed it in my review queue.
Reminder, please take a look at this pr: @damondouglas @damondouglas
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer
:
R: @robertwb for label java. R: @ahmedabu98 for label io.
Available commands:
-
stop reviewer notifications
- opt out of the automated review tooling -
remind me after tests pass
- tag the comment author after tests pass -
waiting on author
- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @robertwb @ahmedabu98
This sounds like a good change.
Formally adding R: @damondouglas for follow-up.
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control
Good day, @prodriguezdefino, Is this PR still active? Would you be able to investigate the two failing checks? Please let me know if you need anything and thank you again.
Sorry on the delay, checking on this now.
Seems that "Run Java PreCommit" got cancelled after 3 hrs execution, and "Run Java_GCP_IO_Direct PreCommit" failed on a Spanner related test.
I'm triggering both again.
Run Java PreCommit
Run Java_GCP_IO_Direct PreCommit
Run Java_IO_Direct PreCommit
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.