beam icon indicating copy to clipboard operation
beam copied to clipboard

Reapply "Add Redistribute transform to Java SDK"

Open kennknowles opened this issue 1 year ago • 1 comments

This rolls forward the Java SDK reference implementation of Redistribute. This does not include runner translations.

  • In the prior attempt at this change: many runners failed the Redistribute tests until a custom translation was inserted. I addressed it by adding custom translations that roughly matched Reshuffle.
  • In this attempt, I will focus first on getting all runners passing the tests as a composite, or sickbaying if the runner lacks core capabilities required by the tests. For example it looks like some / many runners may lose windowing metadata. I will add custom runner translations in later iterations.

This reverts commit f498cdfc2a84a49f0d5812e7dd0e55e08b214e4a.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [x] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [x] Update CHANGES.md with noteworthy changes.
  • [x] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

kennknowles avatar May 02 '24 00:05 kennknowles

Notes on failures:

Error received from SDK harness for instruction 3: org.apache.beam.sdk.util.UserCodeException: java.lang.ClassCastException: org.apache.beam.sdk.util.WindowedValue$ValueInGlobalWindow cannot be cast to java.lang.String

I think we can safely drop the Java11 and JavaVersions variations.

kennknowles avatar May 02 '24 13:05 kennknowles

So in conclusion, my analysis is this:

Non-portable:

  • DirectRunner works
  • Dataflow works
  • Samza works
  • Flink works
  • Spark StructuredStreaming works
  • Spark batch has some issue where its shuffle drop panes. Out of scope

Portable:

  • Dataflow works
  • ULR, Flink, Spark, Samza all share the same issue that is likely a bug in the portable translation (greedy fuser?)

Based on this analysis I believe the prior approach of sickbaying the failing tests is valid. There does not seem to be a bug in the Redistribute transform. All runners should correctly run it as a pure composite, and adding translations to the runners for efficiency is a bonus. I will follow up with rolling forward those changes too.

kennknowles avatar May 09 '24 15:05 kennknowles

R: @Abacn

kennknowles avatar May 10 '24 20:05 kennknowles

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

github-actions[bot] avatar May 10 '24 20:05 github-actions[bot]

For the reviewer, summary so far:

  • Roll forward: I am rolling forward just the composite transform and tests, which are pretty trivial, also the same as Reshuffle, and already work correctly on enough runners I believe they are correct.
    • Since the composite should work, we don't need the runner implementations to move forward. In fact, the runner translations were masking bugs I think. I will separately roll forward runner specializations.
  • Filtered tests: Last time I missed tests that were (already) broken, so do you think I missed any this time?
    • When I roll forward translations one runner at a time, I can re-enable anything fixed by them.

kennknowles avatar May 10 '24 20:05 kennknowles

beam_PostCommit_Java_PVR_Spark_Batch needs to exclude from the test added: https://github.com/apache/beam/actions/workflows/beam_PostCommit_Java_PVR_Spark_Batch.yml?query=event%3Aschedule

Abacn avatar May 14 '24 15:05 Abacn

Gya I missed one ;_;

kennknowles avatar May 14 '24 15:05 kennknowles

Yeah because the naming of this test is "beam_PostCommit_Java_PVR_Spark_Batch" not "beam_PostCommit_Java_PVR_Spark3_Batch" so this test wasn't run on this PR

Abacn avatar May 14 '24 16:05 Abacn