iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spark3 structured streaming enable updates

Open karim-ramadan opened this issue 2 years ago • 12 comments

Context

As brought up in issue #2788, the only 2 possible actions if reading an iceberg table as a Spark streaming DataFrame are either to skip it or fail. A third possible option would be to consider only added files and ignore deleted files.

Proposal

In this PR I propose a new spark reading option: streaming-overwrite-snapshots-read-mode with three possible values: SKIP, BREAK, ADDED_FILES_ONLY to substitute the already existing streaming-skip-overwrite-snapshots (true|false)

The new ADDED_FILES_ONLY would consider just adding files.

Notes

  • The old conf streaming-skip-overwrite-snapshots have been maintained and used to integrate with the new one (the new one has higher precedence)
  • Some fixes to unit tests have been applied to make them work on Windows I could revert those changes and address them in another PR if needed

karim-ramadan avatar Apr 07 '23 14:04 karim-ramadan

@SreeramGarlapati @rdblue @davseitsev can anyone have a look at this? Or point to some one that will? Thank you very much

tmnd1991 avatar Apr 12 '23 14:04 tmnd1991

HI @SreeramGarlapati, this PR addresses your issue https://github.com/apache/iceberg/issues/2788, could you or @cwsteinbach @RussellSpitzer @kbendick @rdblue have a look at it ? Thank you very much

karim-ramadan avatar Apr 17 '23 15:04 karim-ramadan

Any update on this? is there something blocking the review or is it a matter or capacity/priority of maintainers?

tmnd1991 avatar Apr 26 '23 08:04 tmnd1991

@karim-ramadan IIUC, your PR would allow user to stream the inserted/updated row from MERGE INTO command to downstream consumer.

jhchee avatar May 02 '23 15:05 jhchee

@karim-ramadan Also, if i understood correctly, this would also stream unmodified row to destination.

jhchee avatar May 03 '23 03:05 jhchee

@karim-ramadan Also, if i understood correctly, this would also stream unmodified row to destination.

Hi, @jhchee yes but only for V1 tables V2 tables would stream only modified rows. I can add a configuration to differentiate between the 2 behaviours by also checking the version of the table if you think it is needed.

I've also rebased on top of master and hopefully fixed the problems encountered in the first run of the CI. Could you run it again, please?

karim-ramadan avatar May 03 '23 07:05 karim-ramadan

Unfortunately, I'm not a committer of this project. However, I'm bringing awareness in the Iceberg Slack channel so someone might look into this.

jhchee avatar May 03 '23 08:05 jhchee

Hi @jhchee any news on this?

karim-ramadan avatar May 10 '23 08:05 karim-ramadan

Hi @karim-ramadan I didn't get a reply on this. I was told that the snapshot level CDC will eventually unblock overwrite streaming but I have limited understanding on this. Ref: https://github.com/apache/iceberg/issues/3941#issuecomment-1531522049

jhchee avatar May 10 '23 08:05 jhchee

I think this is a good idea. I'll put it in my queue to review.

rdblue avatar Jul 13 '23 21:07 rdblue

I think this is a good idea. I'll put it in my queue to review.

Hi @rdblue any news on this ?

karim-ramadan avatar Sep 18 '23 12:09 karim-ramadan

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Aug 28 '24 00:08 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Sep 05 '24 00:09 github-actions[bot]