beam icon indicating copy to clipboard operation
beam copied to clipboard

Run flaky test on prism

Open damccorm opened this issue 6 months ago • 3 comments

This test seems to consistently flake when running against prism in https://github.com/apache/beam/pull/34612 - seeing if it will flake here with just this test on prism.

I've tried reproing locally and have failed


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [ ] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

damccorm avatar Jun 12 '25 20:06 damccorm

Ok - this PR confirms what I've been seeing on a larger scale in https://github.com/apache/beam/pull/34612. It seems like Prism does not consistently fail jobs when they have errors. It does most of the time (e.g. this test passed in 3 Precommit Python runs and also some other suites that run it), but occasionally it will succeed. When you look at the stderr, you can see that the error was indeed raised. So I don't think that is the problem. There may be something else about this particular pipeline shape, but I've seen it on a bunch of different pipelines in https://github.com/apache/beam/pull/34612 so it is not just that. I have not been able to repro this yet locally.

@lostluck in case you have any ideas or code pointers since I've been banging my head against the wall a bit.

Failing job is here for posterity - https://github.com/apache/beam/actions/runs/15635781280/job/44050389378?pr=35263 - I'll try to see if I can get more debugging info out of this.

damccorm avatar Jun 13 '25 17:06 damccorm

Not offhand. I see the error, I see a print out. But nothing from Prism (which is notionally intentional, but could be a set up problem Python side not capturing additional output from prism).

If the flake is reproducible, I recommend additional logging to narrow down where the disconnect is happening. Since the flow basically should be Python-side error -> Prism fails the pipeline and returns an error to the JobManagementServer -> JobManagement server reports that state to the Python process -> Ultimately raise an error or exception that can be caught in tests.

But the flow could also be "Python side error -> caught immeadiately by the test due to it happening in the local process" (AKA the danger of direct runners).

You can add a log_level flag set to debug here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/prism_runner.py which will make for much more verbose prism output. But that'll be quite verbose.

I'd recommend just adding an info/error log here: https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/execute.go#L84

But Prism side, is already doing an error log for failed job anyway, and it's not clear why that's not showing up anywhere. It's possibly not getting captured Python side for some reason.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L223 is the handling for the "state" of portable runner events, which might also be useful to be a bit more verbose on.

lostluck avatar Jun 13 '25 20:06 lostluck

Basically "errors and exceptions" as output from tests are tricky since type information is guaranteed to be lost. It's interpreting magic strings all the way down across the portable boundary. That's where I'd assume the issue is. (But again, it should be confirmed that Prism is indicating to Python that the job is failing first, as that's an assumption that needs to be checked in this case)

lostluck avatar Jun 13 '25 20:06 lostluck

I inserted a panic here (added commit demonstrating that) and all jobs correctly failed at that point - https://github.com/apache/beam/actions/runs/15684285429/job/44183364940?pr=35263

So prism is at least correctly seeing that we should fail the bundle.

damccorm avatar Jun 16 '25 15:06 damccorm

OK, so that path, should trigger the Element manager to fail when stage.Execute returns here:

https://github.com/apache/beam/blob/2a499b6b24180adb022d810cd9bccf2f271fe514/sdks/go/pkg/beam/runners/prism/internal/execute.go#L366

Which should cause the bundle channel to ultimately close, and then wait on the error group to receive the failure here:

https://github.com/apache/beam/blob/2a499b6b24180adb022d810cd9bccf2f271fe514/sdks/go/pkg/beam/runners/prism/internal/execute.go#L359

But that's not what fail bundle does..., it clears out the pending execution state from the manager so that the job can fail naturally. (This is intentional FYI, since the ElementManager is about tracking and dispatching pending work, and in principal we might want to add retries in the future.)

So I think there's probably race condition here, which sometimes swallows the failures, as much as I don't see how that's happening.

I'd put an atomic value before the execute value, that we also set in that bundle failure case, but only for the first one received. Then if and only if the error cause from the context, or the error from the errgroup is nil, do we check that value and return any non-nil error.


The context Cancel path is likely only happening in the event the SDK side environment is wholesale crashing. Which should produce an error. It's possible, that the "race" here is:

  1. SDK side bundle fails
  2. SDK environment crashes (AKA the loopback FnAPI connection), causing the Context to be canceled Prism side.
  3. The Bundle Execute loop responds to the canceled environment -> Job failure messages doesn't have the failed bundle.

While the intended/desired path is

  1. SDK side bundle fails.
  2. Prism registers the failure reason.
  3. Prism returns that error to the SDK pipeline host.

I don't think it's contentious to assert we should prioritize the first SDK side bundle failure error vs any environmental / context cancel bundle errors.

lostluck avatar Jun 16 '25 16:06 lostluck

The underlying issue has been fixed by https://github.com/apache/beam/pull/35337

damccorm avatar Jun 17 '25 21:06 damccorm