Spark3 structured streaming enable updates
Context
As brought up in issue #2788, the only 2 possible actions if reading an iceberg table as a Spark streaming DataFrame are either to skip it or fail. A third possible option would be to consider only added files and ignore deleted files.
Proposal
In this PR I propose a new spark reading option:
streaming-overwrite-snapshots-read-mode
with three possible values: SKIP, BREAK, ADDED_FILES_ONLY
to substitute the already existing
streaming-skip-overwrite-snapshots (true|false)
The new ADDED_FILES_ONLY would consider just adding files.
Notes
- The old conf streaming-skip-overwrite-snapshots have been maintained and used to integrate with the new one (the new one has higher precedence)
- Some fixes to unit tests have been applied to make them work on Windows I could revert those changes and address them in another PR if needed
@SreeramGarlapati @rdblue @davseitsev can anyone have a look at this? Or point to some one that will? Thank you very much
HI @SreeramGarlapati, this PR addresses your issue https://github.com/apache/iceberg/issues/2788, could you or @cwsteinbach @RussellSpitzer @kbendick @rdblue have a look at it ? Thank you very much
Any update on this? is there something blocking the review or is it a matter or capacity/priority of maintainers?
@karim-ramadan IIUC, your PR would allow user to stream the inserted/updated row from MERGE INTO command to downstream consumer.
@karim-ramadan Also, if i understood correctly, this would also stream unmodified row to destination.
@karim-ramadan Also, if i understood correctly, this would also stream unmodified row to destination.
Hi, @jhchee yes but only for V1 tables V2 tables would stream only modified rows. I can add a configuration to differentiate between the 2 behaviours by also checking the version of the table if you think it is needed.
I've also rebased on top of master and hopefully fixed the problems encountered in the first run of the CI. Could you run it again, please?
Unfortunately, I'm not a committer of this project. However, I'm bringing awareness in the Iceberg Slack channel so someone might look into this.
Hi @jhchee any news on this?
Hi @karim-ramadan I didn't get a reply on this. I was told that the snapshot level CDC will eventually unblock overwrite streaming but I have limited understanding on this. Ref: https://github.com/apache/iceberg/issues/3941#issuecomment-1531522049
I think this is a good idea. I'll put it in my queue to review.
I think this is a good idea. I'll put it in my queue to review.
Hi @rdblue any news on this ?
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.