druid icon indicating copy to clipboard operation
druid copied to clipboard

Add The Ability to View SQL Queries in Druid Web UI + Ability to Cancel Long Running Queries

Open acherla opened this issue 2 years ago • 12 comments

Description

Add the ability in the web UI to view currently running SQL queries (Queries executed through /sql api and those executed through the web console) w/ the ability to cancel currently running queries (using the cancel query api introduced in druid 0.23.0). The UI at a bare minimum should include the Query Creation Time, QueryID, SQL Query Executed, Current Runtime, Status (RUNNING, CANCELLED, COMPLETED), Return Code (200, 4xx, 5xxx etc...)

Motivation

The benefit of being able to view and/or cancel queries will allow the administrators of druid clusters to be able to better troubleshoot and debug SQL queries that are being executed against the druid platform in real time. Currently, debugging queries executed against the druid cluster is a long and tedious process that requires manually linking the queryID to a given Broker for every request executed against the platform (Requires external platforms like ELK or some log management system to correlate a queryID to a particular query being executed). This will simplify the overall troubleshooting experience and provide useful information for administrators to better manage the platform.

Possible approach

  1. The Broker should maintain a temporary cache ( w/ a configurable expiry?) of all queryId's that are being executed against the platform and the current state of those queries (Running, Cancelled/Stopped etc...).
  2. The broker then should expose an API endpoint that allows the router to query the broker metadatacache for all available queryId's to be displayed in the web console UI This API can also be used to pull all queryId's that are running against the cluster.
  3. Router Web Console UI displays all QueryID's + SQL Query + Button to cancel query (API now available in 0.23.0)

acherla avatar Jun 30 '22 10:06 acherla

The ability to cancel the query via console is already available (https://github.com/apache/druid/pull/11738).

BTW how long do these queries run typically run?

abhishekagarwal87 avatar Jun 30 '22 10:06 abhishekagarwal87

Being able to cancel a query via the console is not the same as the request above. Cancelling a query via the console only works for when executing a query in the webconsole itself. The requirement above is for cancelling any query thats executed against the druid brokers, and specifically to display which queries are currently running against the druid cluster.

Queries can run from a few seconds to a few minutes for larger queries. In either situation having a view in the web UI that displays what queries are running vs which queries are timing out/erroring or cancelled is pretty critical.

This would require each broker to broadcast (through an endpoint) what are the currently running queryId's on each broker, that the router can then poll for the current status of that queryid and/or optionally cancel that query.

acherla avatar Jun 30 '22 11:06 acherla

The router broadcasts DELETE requests to all the brokers. So you can send the cancel request to any router. The console does it the same way.

abhishekagarwal87 avatar Jul 04 '22 15:07 abhishekagarwal87

Hi @abhishekagarwal87 I am aware the router can broadcast a delete to all brokers, but thats like the tip of the iceberg. Having the router be able to delete a query doesn't really matter when its difficult to know which queryId's are actually running against druid.

acherla avatar Jul 04 '22 16:07 acherla

ok. I see. This is why I asked about the typical duration of your queries. I am not sure if there is a practical use for being able to cancel running queries from UI. Given the queries are short-lived. You said that you can troubleshoot better if you had the ability to view and cancel a running query. can you elaborate further on that? Like what will the user workflow be if they had this capability.

abhishekagarwal87 avatar Jul 04 '22 17:07 abhishekagarwal87

This feature seems like a realtime query monitoring. It's an interesting topic.

For administrators, sometimes it's useful to kill some long-time running queries. We usually do such operations on MySQL too. MySQL provides a show processlist SQL to print out all executing queries(including query id, query, elapsed). Based on this, users can determine which are slow queries and then they can choose to kill queries.

So I think the core ability that Druid should provide is to query running queries. If we have similar show processlist feature, users can kill queries they like. And then whether we need to provides visibility of running queries is another topic we need to discuss.

FrankChen021 avatar Jul 09 '22 03:07 FrankChen021

FWIW, unlike MySQL, the druid has a default query timeout of 5 minutes. Still interested in learning more about the troubleshooting workflow that this feature will enable.

abhishekagarwal87 avatar Jul 09 '22 11:07 abhishekagarwal87

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Jan 06 '24 00:01 github-actions[bot]

not stale

abhishekagarwal87 avatar Jan 06 '24 03:01 abhishekagarwal87

FWIW, there is a tempting use case for this now we do have long-running queries.

abhishekagarwal87 avatar Jan 06 '24 03:01 abhishekagarwal87

@abhishekagarwal87, This would be an useful feature for us. It is kind of like a starting point of query insight tool. Since no one work on this. Let me give it a try; maybe a design first to reach some consensus.

Do we have a template for design?

kaisun2000 avatar Jan 11 '24 22:01 kaisun2000

Copy the slack channel comment here:

For the past two month, I worked on some query performance tuning in our system, mainly to reduce the query latency and specifically in the real time path. In the mean time, I find that a query insight tool like proposed above would be very valuable. More specifically, the following would be useful and some of them are missing: 1/ per segment query processing time 2/ per segment waiting time for the processing queue and current queue size 3/ per hydrant query processing time -- missing 4/ per hydrant waiting time thread time -- if paralleled, missing 5/ per query merge buffer acquisition time and current waiting queue size -- missing 6/ post segment processing grouper potentially spilling to disk time -- missing These information should not be limited to only brokers, but also data node (peons and historicals).

The main idea is attribute time spent for each query in each stage and report contention points statistics such as the thread pool queue length while waiting, or merge buffer queue length while waiting, or spilling to disk activity for groupers. report this data collectively to some UI so that admin can have a direct insights above queries and why they may be slow.

kaisun2000 avatar Mar 08 '24 02:03 kaisun2000