trino ThreadPerDriverTaskExecutor causes workes to stuck

After Trino upgrade from version 409 to 444 we started wo experience issues with stuck workers. They refuse HTTP requests or the requests start to hang forever and desappear from discovery service.

Logs are full of messages like:

2024-04-09T13:31:03.821Z	WARN	node-state-poller-0	io.trino.metadata.RemoteNodeState	Node state update request to http://10.42.110.232:8181/v1/info/state has not returned in 35.08s
...
2024-04-09T13:33:38.872Z	ERROR	http-worker-529	io.trino.server.InternalAuthenticationManager	Internal authentication failed
io.jsonwebtoken.ExpiredJwtException: JWT expired 12872 milliseconds ago at 2024-04-09T13:33:26.000Z. Current time: 2024-04-09T13:33:38.872Z. Allowed clock skew: 0 milliseconds.
	at io.jsonwebtoken.impl.DefaultJwtParser.parse(DefaultJwtParser.java:682)

Where 10.42.110.232 is an IP of problematic worker.

Number of open file descriptors on the problematic worker dramatically increases:

Sometimes we reach open sockets limit and got a lot of exceptions in logs like this:

2024-04-09T09:37:55.827Z	WARN	http-worker-124-acceptor-0@4e826fd4-http@44e08a7a{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:8181}	org.eclipse.jetty.server.AbstractConnector	Accept Failure
java.io.IOException: Too many open files
	at java.base/sun.nio.ch.Net.accept(Native Method)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.implAccept(ServerSocketChannelImpl.java:433)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:399)
	at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:397)
	at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:698)
	at org.eclipse.jetty.util.thread.MonitoredQueuedThreadPool$1.run(MonitoredQueuedThreadPool.java:73)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:971)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1201)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1156)
	at java.base/java.lang.Thread.run(Thread.java:1583)

But it happens not always. On Trino 409 we had limit for open file descriptors 128K on workes. When we met this issue we increased the limit to 256K, but it doesn't help. This limit is exceeded very quickly.

Number of threads also start to increase:

Thread dump shows that single random thread blocks many other threads. Example 1:

Blocking thread stack trace (Full thread dump trino-worker-D0409-T0940-25683.tdump.txt) :

Task-20240409_093501_07390_nd69d.3.16.0-5123
Stack Trace is:
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.start0([email protected]/Native Method)
at java.lang.Thread.start([email protected]/Thread.java:1553)
- locked <0x00007f6d32039e00> (a java.lang.Thread)
at java.lang.System$2.start([email protected]/System.java:2577)
at jdk.internal.vm.SharedThreadContainer.start([email protected]/SharedThreadContainer.java:152)
at java.util.concurrent.ThreadPoolExecutor.addWorker([email protected]/ThreadPoolExecutor.java:953)
at java.util.concurrent.ThreadPoolExecutor.execute([email protected]/ThreadPoolExecutor.java:1375)
at com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:640)
at java.util.concurrent.AbstractExecutorService.submit([email protected]/AbstractExecutorService.java:134)
at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:73)
at io.trino.execution.executor.scheduler.FairScheduler.submit(FairScheduler.java:161)
- locked <0x00007f57374acb88> (a io.trino.execution.executor.scheduler.FairScheduler)
at io.trino.execution.executor.dedicated.TaskEntry.runSplit(TaskEntry.java:136)
- locked <0x00007f6d32000000> (a io.trino.execution.executor.dedicated.TaskEntry)
at io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor.enqueueSplits(ThreadPerDriverTaskExecutor.java:157)
- locked <0x00007f57373c35b0> (a io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor)
at io.trino.execution.SqlTaskExecution.enqueueDriverSplitRunner(SqlTaskExecution.java:400)
- locked <0x00007f6285ff4580> (a io.trino.execution.SqlTaskExecution)
at io.trino.execution.SqlTaskExecution.scheduleDriversForTaskLifeCycle(SqlTaskExecution.java:389)
at io.trino.execution.SqlTaskExecution.start(SqlTaskExecution.java:205)
- locked <0x00007f6285ff4580> (a io.trino.execution.SqlTaskExecution)
at io.trino.execution.SqlTask.tryCreateSqlTaskExecution(SqlTask.java:561)
- locked <0x00007f5eb809b388> (a java.lang.Object)
at io.trino.execution.SqlTask.updateTask(SqlTask.java:503)
at io.trino.execution.SqlTaskManager.doUpdateTask(SqlTaskManager.java:555)
at io.trino.execution.SqlTaskManager.lambda$updateTask$9(SqlTaskManager.java:489)
at io.trino.execution.SqlTaskManager$$Lambda/0x00007f566ab68000.call(Unknown Source)
at io.trino.$gen.Trino_444____20240409_090553_2.call(Unknown Source)
at io.trino.execution.SqlTaskManager.updateTask(SqlTaskManager.java:489)
at io.trino.server.TaskResource.createOrUpdateTask(TaskResource.java:153)
...

Example 2:

Blocking thread stack trace (Full thread dump trino-worker-D0409-T0949-25683.tdump.txt):

SplitRunner-24734
Stack Trace is:
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.start0([email protected]/Native Method)
at java.lang.Thread.start([email protected]/Thread.java:1553)
- locked <0x00007f6db5a217a8> (a java.lang.Thread)
at java.lang.System$2.start([email protected]/System.java:2577)
at jdk.internal.vm.SharedThreadContainer.start([email protected]/SharedThreadContainer.java:152)
at java.util.concurrent.ThreadPoolExecutor.addWorker([email protected]/ThreadPoolExecutor.java:953)
at java.util.concurrent.ThreadPoolExecutor.execute([email protected]/ThreadPoolExecutor.java:1375)
at com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:640)
at java.util.concurrent.AbstractExecutorService.submit([email protected]/AbstractExecutorService.java:134)
at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:73)
at io.trino.execution.executor.scheduler.FairScheduler.submit(FairScheduler.java:161)
- locked <0x00007f57374acb88> (a io.trino.execution.executor.scheduler.FairScheduler)
at io.trino.execution.executor.dedicated.TaskEntry.runSplit(TaskEntry.java:136)
- locked <0x00007f5cbd1881d8> (a io.trino.execution.executor.dedicated.TaskEntry)
at io.trino.execution.executor.dedicated.TaskEntry.dequeueAndRunLeafSplit(TaskEntry.java:116)
- locked <0x00007f5cbd1881d8> (a io.trino.execution.executor.dedicated.TaskEntry)
at io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor.scheduleLeafSplit(ThreadPerDriverTaskExecutor.java:170)
at io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor.scheduleMoreLeafSplits(ThreadPerDriverTaskExecutor.java:190)
- locked <0x00007f57373c35b0> (a io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor)
at io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor.leafSplitDone(ThreadPerDriverTaskExecutor.java:181)
- locked <0x00007f57373c35b0> (a io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor)
at io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor$$Lambda/0x00007f566abdfd50.run(Unknown Source)
at io.trino.execution.executor.dedicated.TaskEntry.lambda$dequeueAndRunLeafSplit$0(TaskEntry.java:119)
at io.trino.execution.executor.dedicated.TaskEntry$$Lambda/0x00007f566abe2ad0.run(Unknown Source)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
...

We have changed value of experimental.thread-per-driver-scheduler-enabled to false and now the issue is not reproduced, our cluster looks stable https://github.com/trinodb/trino/blob/b7a161a422850c35b992f3ba26ac1a2f8bc9eb54/core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java#L110-L115 I will update if anything changes.

Let me know if I can provide more information.

Apr 11 '24 10:04 davseitsev

cc: @martint @findepi

Apr 11 '24 13:04 raunaqmorarka

@davseitsev appreciate detailed analysis, that's very helpful!

Apr 11 '24 16:04 findepi

Thanks a lot for sharing your findings and a potential solution for the problem. We had a similar issue on our Trino cluster started after we upgraded Trino from version 426 to 442. Workers started to have exchange issues more frequently with the logs:

2024-04-25T10:24:55.409Z        WARN    node-state-poller-0     io.trino.metadata.RemoteNodeState       Node state update request to https://xxx.ip:8081/v1/info/state has not returned in 10.02s
2024-04-25T10:24:55.410Z        WARN    node-state-poller-0     io.trino.metadata.RemoteNodeState       Node state update request to https://xxx.ip:8081/v1/info/state has not returned in 10.02s
2024-04-25T10:24:58.184Z        INFO    Thread-47       io.airlift.bootstrap.LifeCycleManager   JVM is shutting down, cleaning up
2024-04-25T10:24:58.185Z        INFO    Thread-42       io.airlift.bootstrap.LifeCycleManager   JVM is shutting down, cleaning up

I see the experimental.thread-per-driver-scheduler-enabled was enabled since version 438(https://github.com/trinodb/trino/pull/20451) so we decided to disable it and it helped with stabilizing the cluster. We still see some exchange issues but the number of impacted queries used to be 14x more before disabling the feature.

May 03 '24 10:05 nesat

Chiming in for more feedback - we're seeing exactly the same symptoms here on an update from 436 to 444. Nodes are going down, queries failing left and right, and a very very unstable cluster altogether.

If it helps at all, we're also seeing a ~30% query speed slowdown on smaller benchmarks, before nodes go unresponsive. Running the benchmark for longer (not very long in the grand scheme of things, maybe 5 to 10 minutes at ~half our current prod concurrency) causes unresponsive nodes.

I can also confirm setting experimental.thread-per-driver-scheduler-enabled=false fixes it completely: performance is restored to Trino 436 levels and nodes are stable.

May 15 '24 19:05 Pluies

Are you able to narrow it down to specific query shapes? It would help immensely to debug if we can get some examples that cause the cluster to fall over, or that perform more slowly.

May 15 '24 19:05 martint

@martint let me dig into our benchmark and see if I can narrow it down to specific shapes 👍

May 16 '24 15:05 Pluies

@martint well it's been a bit of a wild goose chase. I have a list of 100 queries that do cause the issue, but I've been unable to pinpoint the issue to any single query in particular. Happy to share this list in private if it helps.

All these queries:

Are read-only
Use the Delta Lake connector
Use S3 caching

@davseitsev are there any communalities with your workloads? 🤔

I can't imagine this issue affects every Trino workloads or there would have been much more noise about it after the 436 release.

May 22 '24 07:05 Pluies

Do the queries have joins, union, many stages? Does it happen if you load the cluster with any specific type of query, or only when you mix them up?

May 22 '24 09:05 martint

Do the queries have joins, union, many stages?

Yes on all three counts 😅

Does it happen if you load the cluster with any specific type of query, or only when you mix them up?

It appears to be the mixing, or at least some level of concurrency, as I cannot reproduce this by executing queries one-by-one sequentially.

Our clusters can sustain ~40 concurrent queries on average, and the problem appears with ~10 concurrent queries, so it's not like we're pushing the envelope in terms of hardware either.

May 22 '24 12:05 Pluies

Thanks a lot. You saved my life :)

I first encountered this issue when I updated from v433 to v451. To pinpoint the cause, I downgraded to v439, but the problem persisted. So, I was in the process of checking the changes between v433 and v439.

@nesat, Thanks to you mentioning #20451 in this issue, I was able to navigate to this issue page from that PR.

Jul 23 '24 09:07 okayhooni

@Pluies is the issue fixed or is still present?

Jul 23 '24 10:07 sopel39

@sopel39 I haven't tested recently. Which version would you expect a fix to be present in?

Jul 23 '24 10:07 Pluies

@martint ^^?

Jul 25 '24 14:07 sopel39

The issue is still present. We need to be able to reproduce it and figure out the root cause.

Jul 25 '24 14:07 martint

@Pluies I observed the same behaviour when running a certain set of queries, but I could not reproduce it when running them 1 by 1. This must be some concurrency-related thing. One thing which I observed is that some of my queries in the query set (4 out of 80) fail with the error: operator not found bigint = bigint, before the cluster starts underperforming and the worker get unresponsive. Also, these queries ran successfully when triggered 1 by 1.

Jul 27 '24 08:07 priyanshu-delhivery

fail with the error: operator not found bigint = bigint

there used to be a bug in JVM's JIT that was causing spurious errors like that (https://github.com/trinodb/trino/issues/5758), but it's believed to have been fixed.

i don't see how an error like that could be related to ThreadPerDriverTaskExecutor, so likely it is a separate problem. Do you have full stacktrace?

Jul 27 '24 20:07 findepi

I found SQL run fast when experimental.thread-per-driver-scheduler-enabled =ture but thread num is more (version 443)

Aug 07 '24 07:08 hqbhoho

I tried to disable experimental.thread-per-driver-scheduler-enabled and still have the same symptoms. Using Trino 442 several months, but just last week encountered this issue. Any ideas what else to check? (more details in the slack channel )

Aug 15 '24 09:08 netanelm-upstream

Added diagnostics to help track down the issue: https://github.com/trinodb/trino/pull/23045

Sep 09 '24 16:09 martint

We have been dealing with stuck workers on our clusters that run merge and maintenance queries against iceberg. Typically a given worker will fail a health check multiple times over a 20 or 30 minute period. To give a sense of the frequency, we run 2 of the same merge query across ~2000 tables every hour.

As with others in the thread, experimental.thread-per-driver-scheduler-enabled=false has made the issue disappear entirely. Happy to provide more info if it would help get to the bottom of this problem.

Nov 25 '24 23:11 jonathanmyers

@jonathanmyers thanks for more info. How many queries do you run in parallel?

Nov 26 '24 09:11 losipiuk

We have been dealing with stuck workers on our clusters that run merge and maintenance queries against iceberg. Typically a given worker will fail a health check multiple times over a 20 or 30 minute period. To give a sense of the frequency, we run 2 of the same merge query across ~2000 tables every hour.

As with others in the thread, experimental.thread-per-driver-scheduler-enabled=false has made the issue disappear entirely. Happy to provide more info if it would help get to the bottom of this problem.

what version are you on, can you provide logs from the workers

in 455 there is the change @martint mentioned that logs additional info about thread state

Nov 26 '24 11:11 lozbrown

We're on trino 458, and we send 50 merge queries at a time, 100 for our iceberg optimize tasks. We run a baseline of 8 m7g.8xlarge workers and scale up to 16 of those for table maintenance. I've enabled the debug logs and re-enabled thread per driver in our test environment, should be able to get those tomorrow. If there's anything specific from those logs we want let me know, otherwise I'll try to get a sample from around the time a node starts hanging. In the past there has not been anything interesting in the logs-- the rest of the cluster throws errors related to it going down but that's it. Hopefully the debug logs will shed some light.

Dec 02 '24 23:12 jonathanmyers

If there's anything specific from those logs we want let me know, otherwise I'll try to get a sample from around the time a node starts hanging.

It should not be too much volume. If you enable DEBUG log level on workers for io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor when cluster is unstable it should log a diagnostic snippet every 30s. You can do that at runtime via JMX.

Dec 03 '24 14:12 losipiuk

Here are the debug logs from a stuck worker stuck_worker_logs.txt

Dec 03 '24 21:12 jonathanmyers

Internally after upgrading to Trino-440 we faced the same issue. One thing is, this happened only in clusters that had lot of writes happening. Read-only clusters didn't show up this issue.

Jan 16 '25 23:01 maswin

Is there any clear update here. should this continue to be disabled?

Jan 17 '25 11:01 lozbrown

@lukasz-stec connect me if I am wrong but IIRC the current state of understanding is that we just get into trouble when we are getting into situation where worker has planty of Drivers runnable which translates to plenty of Threads. As with java 23 those threads are OS threads, after crossing some threshold we just make JVM unstable and everything falls apart. If you have workflow which can put worker in such state then ThreadPerDriverTaskExecutor is not for you and you should turn it off. Situation may (should) improve with Java 24 when virtual threads are supposed to be usable.

cc: @wendigo

Jan 17 '25 11:01 losipiuk

There has also been some progress in resolving this issue. We can see this reproduced by running multiple, like 8, concurrent tpcds q04 queries. This is because q04 has a lot of intermediate splits that end up all running on a dedicated thread.

There was an effort by @hashhar to fix the issue with virtual threads but since the JVM 23 still blocks carrier threads on synchronized it was not working well. One of the options is to wait for the improvements in virtual threads, another one is to replace synchronized with java.util.concurrent.locks.

Jan 17 '25 12:01 lukasz-stec

@lukasz-stec this will pin either way in guava's loadingcache

Jan 17 '25 12:01 wendigo

trino trino copied to clipboard

ThreadPerDriverTaskExecutor causes workes to stuck

trino
trino copied to clipboard