opensearch-migrations icon indicating copy to clipboard operation
opensearch-migrations copied to clipboard

[BUG]Replayer thread hanging while replay from inputStream

Open ParvelAWS opened this issue 1 year ago • 4 comments

What is the bug?

Replayer is correctly consuming an input file stream consists of protobuf messages. Given parameters: --insecure --sigv4-auth-header-service-region "es,us-east-1" -i /tmp/test.pb https://some-os-domain

After played half of the messages in file, the thread hang with: 2024-03-19 21:31:40,280: 200, 200, -19158, 23, 72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37.2 [INFO ] 2024-03-19 21:31:40,280 [targetConnectionPool-1-2] TransactionSummaryLogger - 200, 200, -19158, 23, 72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37.2 [WARN ] 2024-03-19 21:31:40,289 [targetConnectionPool-1-2] RequestSenderOrchestrator - Scheduled future did not successfully run IndexedChannelInteraction(channelKey=1cc9f27e-0261-44ec-ad0b-437004851862.72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37.1, index=2) java.lang.IllegalStateException: stream has already been operated upon or closed at java.base/java.util.stream.AbstractPipeline.spliterator(AbstractPipeline.java:346) ~[?:?] at java.base/java.util.stream.ReferencePipeline.iterator(ReferencePipeline.java:143) ~[?:?] at org.opensearch.migrations.replay.RequestSenderOrchestrator.lambda$scheduleSendOnConnectionReplaySession$40(RequestSenderOrchestrator.java:220) ~[trafficReplayer.jar:?] at org.opensearch.migrations.replay.RequestSenderOrchestrator.lambda$scheduleSendOnConnectionReplaySession$41(RequestSenderOrchestrator.java:228) ~[trafficReplayer.jar:?] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566) [netty-transport-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.100.Final.jar:4.1.100.Final] at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

How can one reproduce the bug?

Replay from input file attached. test.pb.zip

What is the expected behavior?

The replayer finished playing all messages in file and exit.

What is your host/environment?

EKS Pod, JDK 17 with AWS Linux 2023

Do you have any screenshots?

No

Do you have any additional context?

Add any other context about the problem.

ParvelAWS avatar Mar 19 '24 21:03 ParvelAWS

Could you please let me know what release/commit you were running when you experienced this?

gregschohn avatar Mar 28 '24 20:03 gregschohn

While I don't see the same stream re-use exception on mainline right now, I do notice that there's pending outstanding work for 72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37. The timestamps on the observations of that stream aren't chronological, which causes an exception to fire if -ea is being used and will cause other strange issues when it isn't. Here's the summary of the TrafficStream, notice that these observations are all from one record. My guess is that the clock on the computer may have been adjusted backward, creating the anomaly.

TrafficStream Summary: {72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37 (#1)[
2024-03-19T15:49:24.003514406Z: WRITE HTT, 
2024-03-19T15:49:53.919982284Z: READ POS, 
2024-03-19T15:49:53.919982284Z: EOM, 
2024-03-19T15:49:56.407839481Z: READ POS, 
2024-03-19T15:49:56.407839481Z: EOM, 
2024-03-19T15:49:37.248713992Z: WRITE HTT, 
2024-03-19T15:49:37.248951099Z: WRITE 0, 
2024-03-19T15:49:57.257858646Z: READ POS, 
2024-03-19T15:49:57.257858646Z: EOM, 
2024-03-19T15:48:58.492549519Z: READ GET, 
2024-03-19T15:48:58.492549519Z: EOM, 
2024-03-19T15:48:58.795272298Z: WRITE HTT, 
2024-03-19T15:48:58.795332286Z: WRITE 0, 
2024-03-19T15:49:58.797475961Z: CLOSE]}

gregschohn avatar Mar 28 '24 21:03 gregschohn

Sorry the commit i was using is a bit out-of-sync. commit 63ccf853ed2d16f251c78d9a8897c0907477eaca

I cannot reproduce this anymore. It may be related to the target domain availability as well.

Could you please let me know what release/commit you were running when you experienced this?

ParvelAWS avatar Mar 28 '24 21:03 ParvelAWS

I've encountered this

[WARN ] 2024-04-01 20:29:40,756 [targetConnectionPool-3-1] RequestSenderOrchestrator - Scheduled future did not successfully run IndexedChannelInteraction(channelKey=d9f9d3ac-a02c-46aa-b842-f67320505052.027e57fffee80927-0000000a-00001185-649b0a3562694922-98a320e2.1143\|partition=0\|offset=295150, index=7423)
--
java.lang.IllegalStateException: stream has already been operated upon or closed
at java.base/java.util.stream.AbstractPipeline.spliterator(Unknown Source) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.iterator(Unknown Source) ~[?:?]
at org.opensearch.migrations.replay.RequestSenderOrchestrator.lambda$scheduleSendRequestOnConnectionReplaySession$17(RequestSenderOrchestrator.java:149) ~[trafficReplayer.jar:?]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at java.base/java.lang.Thread.run(Unknown Source) [?:?]

AndreKurait avatar Apr 02 '24 16:04 AndreKurait

scheduleSendOnConnectionReplaySession no longer takes a Stream in. To support retries, it needed the ability to pull the contents of a request repeatedly, making a one-use Stream invalid. Newer versions of this code (post 3cc34a7f0e66da9e824242eb32f143122d45da47) won't be able to exhibit this issue.

gregschohn avatar Oct 16 '24 15:10 gregschohn