[Failing Test]: JmsIOTest. testCheckpointMark flaky
What happened?
Example run: https://github.com/apache/beam/runs/21250090063 (an PR unrelated to Jms):
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at org.apache.beam.sdk.io.jms.JmsIOTest.testCheckpointMark(JmsIOTest.java:463)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Fails at here: https://github.com/apache/beam/blob/d5aa44c9ba9eb910774d789dd4182a5d25d8f552/sdks/java/io/jms/src/test/java/org/apache/beam/sdk/io/jms/JmsIOTest.java#L463
In fact, consumer.receiveNoWait call at https://github.com/apache/beam/blob/d5aa44c9ba9eb910774d789dd4182a5d25d8f552/sdks/java/io/jms/src/main/java/org/apache/beam/sdk/io/jms/JmsIO.java#L559
never guarantees a message will be returned when there are still unacked message on the server: https://stackoverflow.com/questions/36626634/does-jms-receivenowait-guarantee-message-delivery-when-messages-are-available
So there is a chance that the call returns null and fails assert
Issue Failure
Failure: Test is flaky
Issue Priority
Priority: 2 (backlog / disabled test but we think the product is healthy)
Issue Components
- [ ] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
The underlying cause is that there is no guarantee receiveNoWait here: https://github.com/apache/beam/blob/27f1c0774fd93e846de9a8b668e6effc5a41eb10/sdks/java/io/jms/src/main/java/org/apache/beam/sdk/io/jms/JmsIO.java#L559
return a nonnull value when there is pending record in the server side, per JMS specification. This can also affects integration test for the same reason, as we see not all records are read within timeout.