druid The Broker's Java threads hang on certain SQL queries

The Broker Java Threads responsible for running queries will permanently hang on certain SQL queries.

Affected Version

Druid 0.22.1

Description

When certain SQL queries are made to Druid the Broker's Java Threads can hang. This causes the thread to permanently take up CPU and cause high Garbage Collection until the Broker is restarted.

A detailed description of the Java Thread hang issue can be found in this Druid Forum thread. https://www.druidforum.org/t/observed-qtp-threads-hanging-on-broker-with-druid-0-22-1/7206

The Brokers are running on an r5.4xlarge footprint with 16 CPU, 128GB of Memory, and 80G Heap.

The Broker runtime.properties config:

druid.service=druid/broker
druid.plaintextPort=8082
druid.tlsPort=8282

druid.server.http.numThreads=40
druid.server.http.gracefulShutdownTimeout=PT1M
druid.server.http.defaultQueryTimeout=60000

druid.broker.http.numConnections=30
druid.broker.http.maxQueuedBytes=10000000

druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=16
druid.processing.numThreads=15

druid.query.scheduler.prioritization.strategy=threshold
druid.query.scheduler.prioritization.durationThreshold=PT24H
druid.query.scheduler.prioritization.adjustment=-1
druid.sql.planner.requireTimeCondition=false
druid.sql.planner.metadataSegmentCacheEnable=true
druid.sql.planner.metadataSegmentPollPeriod=60000

druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false

druid.monitoring.monitors=["org.apache.druid.java.util.metrics.SysMonitor","org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmCpuMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor","org.apache.druid.client.cache.CacheMonitor","org.apache.druid.server.metrics.QueryCountStatsMonitor"]

The Broker runtime arguments:

java -Xms12G -Xmx12G -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dorg.jboss.logging.provider=slf4j -Dnet.spy.log.LoggerImpl=net.spy.memcached.compat.log.SLF4JLogger -Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown -Dlog4j.shutdownHookEnabled=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/var/log/druid.gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -XX:+UseBiasedLocking -Xms80G -Xmx80G -XX:MaxDirectMemorySize=40G -Dcom.sun.management.jmxremote -Dcom.sun.magement.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.password.file=/opt/druid/conf/druid/_common/jmxremote.password -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=10.10.203.144 -cp /opt/druid/conf/druid/_common:/opt/druid/conf/druid/broker:/opt/druid/lib/* org.apache.druid.cli.Main server broker

We have recently been able to determine the type of SQL queries which are causing this problem, and can reproduce the issue on any of our clusters.

The query must:

Use time bucketing.
Filter on a numeric field with an empty string.

Example query:

SELECT AVG("value"), TIME_FLOOR("__time", 'PT1m')
FROM "testing:selfmonitor.metrics"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '10' MINUTE
AND "value" = ''
GROUP BY TIME_FLOOR("__time", 'PT1m')

The query does not cause a thread hang if:

Time bucketing is removed.
The WHERE clause is changed to filter on a string field. (example: "metric" = '')
The empty string contains some value. (example: "value" = '1')

This has been an issue for us because of queries coming from Grafana. If a dashboard's variable does not get set properly then it has no value, and the SQL queries sent to Druid contain these empty string comparisons. If the comparison is done on a numeric field then the Broker's CPU will increase permanently until we reboot it.

Oct 04 '22 16:10 sh-vs

Interesting. I am debugging another similar-sounding issue with TIME_FLOOR causing hung queries when run on inline data. It's related to materialization of excessive numbers of time buckets. I wonder if something similar is happening here.

If you have a reproducible test case that someone else can use (including both dataset + query, or a unit test in CalciteQueryTest) that would be helpful. Otherwise we'll try to come up with a unit test for this.

Oct 06 '22 19:10 gianm

I think https://github.com/apache/druid/pull/12944/files should have fixed this. @rohangarg can you confirm?

Oct 07 '22 06:10 abhishekagarwal87

@sh-vs : Are you using null-compatible mode in your cluster? (setting druid.generic.useDefaultValueForNull to false enables null-compatible mode). If so, then the query you mentioned would get fixed by https://github.com/apache/druid/pull/12944/files which is present as a part of 24.0 release.

@gianm : Yes while looking at the issue and changes above, I had observed that the time buckets generation for time-granularity queries could be improved. One crude way of stopping time-grain generation going out of hand is to limit the amount of grains allowed to be generated by the query. A better but effort requiring way could be to derive the time-grains from data as we read it, instead of generating the time buckets first and then passing data through it.

Oct 07 '22 08:10 rohangarg

Just to add a bit more colour here, the mode matters here because in SQL-compatible mode, the "value" = '' condition becomes an impossible condition for a numeric column. In non SQL-compatible mode, that is a valid condition and is translated to "valud" = 0. When one of the conditions is impossible, the time filter after the AND clause gets removed by the planner and the query becomes an ETERNITY query. This is when the bug fixed by Rohan can surface.

Oct 07 '22 08:10 abhishekagarwal87

Yes while looking at the issue and changes above, I had observed that the time buckets generation for time-granularity queries could be improved. One crude way of stopping time-grain generation going out of hand is to limit the amount of grains allowed to be generated by the query. A better but effort requiring way could be to derive the time-grains from data as we read it, instead of generating the time buckets first and then passing data through it.

A similar problem happens on the ingestion side as well, when someone accidentally selects the wrong column for timestamp and kaboom :) I think your crude solution makes sense. The limit could be high enough that we know something is wrong with the query if a query hits that limit. And the limit is low enough that it stops an OOM. like 100,000 seems like a good number. what do you think?

Oct 08 '22 08:10 abhishekagarwal87

The crude is fine but the problem with that is - incase the actual data produces much lesser time buckets, then we'll reject the query based on incorrect calculation of time buckets from the time intervals. And it might be confusing for users to understand the reason for failure. But if we see this problem more frequently, then probably adding a limit can be a stop gap.

Oct 10 '22 12:10 rohangarg

Just raised #13206, which makes queries like this succeed instead of throwing an error (what we get with only #12944) or hanging (what we get before #12944).

Oct 11 '22 03:10 gianm

Are you using null-compatible mode in your cluster?

Yes we are.

As a test I switched this option. With the option set to true the query does not hang, but returns invalid data (a bunch of rows of zeros). I changed the filter to "value" IS NOT NULL, and that caused the thread to hang. I switched the setting back to false, and then the IS NOT NULL ran and returned no data, while the previous "value" = '' caused the thread to hang.

Thanks for all the feedback. We do have plans to upgrade to Druid 24 soon.

Oct 11 '22 19:10 sh-vs

druid druid copied to clipboard

The Broker's Java threads hang on certain SQL queries

Affected Version

Description

druid
druid copied to clipboard