druid
druid copied to clipboard
The Broker's Java threads hang on certain SQL queries
The Broker Java Threads responsible for running queries will permanently hang on certain SQL queries.
Affected Version
Druid 0.22.1
Description
When certain SQL queries are made to Druid the Broker's Java Threads can hang. This causes the thread to permanently take up CPU and cause high Garbage Collection until the Broker is restarted.
A detailed description of the Java Thread hang issue can be found in this Druid Forum thread. https://www.druidforum.org/t/observed-qtp-threads-hanging-on-broker-with-druid-0-22-1/7206
The Brokers are running on an r5.4xlarge footprint with 16 CPU, 128GB of Memory, and 80G Heap.
The Broker runtime.properties config:
druid.service=druid/broker
druid.plaintextPort=8082
druid.tlsPort=8282
druid.server.http.numThreads=40
druid.server.http.gracefulShutdownTimeout=PT1M
druid.server.http.defaultQueryTimeout=60000
druid.broker.http.numConnections=30
druid.broker.http.maxQueuedBytes=10000000
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=16
druid.processing.numThreads=15
druid.query.scheduler.prioritization.strategy=threshold
druid.query.scheduler.prioritization.durationThreshold=PT24H
druid.query.scheduler.prioritization.adjustment=-1
druid.sql.planner.requireTimeCondition=false
druid.sql.planner.metadataSegmentCacheEnable=true
druid.sql.planner.metadataSegmentPollPeriod=60000
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false
druid.monitoring.monitors=["org.apache.druid.java.util.metrics.SysMonitor","org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmCpuMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor","org.apache.druid.client.cache.CacheMonitor","org.apache.druid.server.metrics.QueryCountStatsMonitor"]
The Broker runtime arguments:
java -Xms12G -Xmx12G -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dorg.jboss.logging.provider=slf4j -Dnet.spy.log.LoggerImpl=net.spy.memcached.compat.log.SLF4JLogger -Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown -Dlog4j.shutdownHookEnabled=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/var/log/druid.gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -XX:+UseBiasedLocking -Xms80G -Xmx80G -XX:MaxDirectMemorySize=40G -Dcom.sun.management.jmxremote -Dcom.sun.magement.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.password.file=/opt/druid/conf/druid/_common/jmxremote.password -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=10.10.203.144 -cp /opt/druid/conf/druid/_common:/opt/druid/conf/druid/broker:/opt/druid/lib/* org.apache.druid.cli.Main server broker
We have recently been able to determine the type of SQL queries which are causing this problem, and can reproduce the issue on any of our clusters.
The query must:
- Use time bucketing.
- Filter on a numeric field with an empty string.
Example query:
SELECT AVG("value"), TIME_FLOOR("__time", 'PT1m')
FROM "testing:selfmonitor.metrics"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '10' MINUTE
AND "value" = ''
GROUP BY TIME_FLOOR("__time", 'PT1m')
The query does not cause a thread hang if:
- Time bucketing is removed.
- The WHERE clause is changed to filter on a string field. (example: "metric" = '')
- The empty string contains some value. (example: "value" = '1')
This has been an issue for us because of queries coming from Grafana. If a dashboard's variable does not get set properly then it has no value, and the SQL queries sent to Druid contain these empty string comparisons. If the comparison is done on a numeric field then the Broker's CPU will increase permanently until we reboot it.
Interesting. I am debugging another similar-sounding issue with TIME_FLOOR causing hung queries when run on inline data. It's related to materialization of excessive numbers of time buckets. I wonder if something similar is happening here.
If you have a reproducible test case that someone else can use (including both dataset + query, or a unit test in CalciteQueryTest) that would be helpful. Otherwise we'll try to come up with a unit test for this.
I think https://github.com/apache/druid/pull/12944/files should have fixed this. @rohangarg can you confirm?
@sh-vs : Are you using null-compatible mode in your cluster? (setting druid.generic.useDefaultValueForNull to false enables null-compatible mode). If so, then the query you mentioned would get fixed by https://github.com/apache/druid/pull/12944/files which is present as a part of 24.0 release.
@gianm : Yes while looking at the issue and changes above, I had observed that the time buckets generation for time-granularity queries could be improved. One crude way of stopping time-grain generation going out of hand is to limit the amount of grains allowed to be generated by the query. A better but effort requiring way could be to derive the time-grains from data as we read it, instead of generating the time buckets first and then passing data through it.
Just to add a bit more colour here, the mode matters here because in SQL-compatible mode, the "value" = '' condition becomes an impossible condition for a numeric column. In non SQL-compatible mode, that is a valid condition and is translated to "valud" = 0.
When one of the conditions is impossible, the time filter after the AND clause gets removed by the planner and the query becomes an ETERNITY query. This is when the bug fixed by Rohan can surface.
Yes while looking at the issue and changes above, I had observed that the time buckets generation for time-granularity queries could be improved. One crude way of stopping time-grain generation going out of hand is to limit the amount of grains allowed to be generated by the query. A better but effort requiring way could be to derive the time-grains from data as we read it, instead of generating the time buckets first and then passing data through it.
A similar problem happens on the ingestion side as well, when someone accidentally selects the wrong column for timestamp and kaboom :) I think your crude solution makes sense. The limit could be high enough that we know something is wrong with the query if a query hits that limit. And the limit is low enough that it stops an OOM. like 100,000 seems like a good number. what do you think?
The crude is fine but the problem with that is - incase the actual data produces much lesser time buckets, then we'll reject the query based on incorrect calculation of time buckets from the time intervals. And it might be confusing for users to understand the reason for failure. But if we see this problem more frequently, then probably adding a limit can be a stop gap.
Just raised #13206, which makes queries like this succeed instead of throwing an error (what we get with only #12944) or hanging (what we get before #12944).
Are you using null-compatible mode in your cluster?
Yes we are.
As a test I switched this option. With the option set to true the query does not hang, but returns invalid data (a bunch of rows of zeros). I changed the filter to "value" IS NOT NULL, and that caused the thread to hang. I switched the setting back to false, and then the IS NOT NULL ran and returned no data, while the previous "value" = '' caused the thread to hang.
Thanks for all the feedback. We do have plans to upgrade to Druid 24 soon.