superset icon indicating copy to clipboard operation
superset copied to clipboard

Superset becomes unresponsive if a database is not responding

Open aaronfeng opened this issue 4 years ago • 9 comments

Expected results

If there's an issue with a single database it should not crash the whole system.

Actual results

A couple days ago we experienced a production outage when we noticed all of the web frontends (~10) were unresponsive and removed by the ELB. After restarting all of the web frontends we were able to load some of the pages. However, it became unresponsive again shortly after. Web server logs didn't reveal any obvious errors. Eventually we noticed the databases tab doesn't load at all after a rolling restart. It turned out that our Hive server were not able to accept new connections due to hitting its thread limit. After rebooting Hive server, Superset started to function properly again.

I don't believe many people were trying to load the Databases tab, but people were trying to run adhoc queries using the SQL Editor. Loading the SQL Editor caused the Databases dropdown to load which I believe is similar to loading the Databases tab. During this time the Databases and Schema dropdown were blink.

We are running Superset 1.0.1 Docker image.

Screenshots

Didn't take any screenshots, but the Databases tab was completely blink as if it was trying to load.

aaronfeng avatar Apr 09 '21 15:04 aaronfeng

sounds severe! if you could upload logs that would be helpful @aaronfeng

amitmiran137 avatar Apr 11 '21 06:04 amitmiran137

@amitmiran137 unfortunately I didn't see anything that was useful in the logs. I assume you mean web server logs? It was a lot of trial and error to debug the issue.

aaronfeng avatar Apr 11 '21 17:04 aaronfeng

I dealt with this directly. From what I can tell several operations fail or take unnecessarily long if a db connection fails when they definitely shouldn't. This behavior multiplied the complexity and extent of an otherwise simple outage blocking access for all users on all other DBs.

When a database connection takes a long time to respond we've seen the following endpoints also timeout:

List databases api /api/v1/database/ (breaks sql lab) List Databases page /databaseview/list/ Database Edit/Save

It's not clear why any one of these pages would need to verify all connections before loading data. I haven't dived into the code, but this seems like a major design flaw. During our outage there was no way to debug or modify anything through the web interface. What's worse is that by design Superset doesn't allow deletion of entities whenever there are dependent tables (delete on cascade) so dealing with the permanent deletion of any database is colossally painful. Anyone who has tried to delete and been blocked by a maze of foreign key constraints. I realise not cascading on delete is safer, but not allowing a safe way to do this makes administration unreasonably cumbersome.

CraigChaffee avatar Apr 12 '21 22:04 CraigChaffee

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

stale[bot] avatar May 02 '22 07:05 stale[bot]

While this issue is obviously stale, it still sounds gross. Did anyone on this thread happen to run into this again or gain any further insight they can add? Is this still a risk in (significantly) newer versions of Superset, e.g. 2.1.0?

CC @betodealmeida @nytai @bkyryliuk in case they have any insight into whether this is addressable and/or ought to remain open (despite the reported version of Superset no longer being officially supported).

rusackas avatar May 16 '23 22:05 rusackas

As a rule of thumb we separate API endpoints that hit only the metadata database from API endpoints that hit analytical databases. Requests to the latter should be asynchronous and non-blocking (eg, that's how we do loading function names for the autocomplete in SQL Lab). That being said it could be that there are places where we're not doing that properly. I remember fixing a few use cases (including the function names), but it would be nice to do an audit.

betodealmeida avatar May 16 '23 23:05 betodealmeida

We encounter this issue periodically on Superset 2.1.1. When Superset becomes unresponsive, we inspect Trino and identify a long-running query. Once we terminate that query, everything returns to normal.

iercan avatar Feb 27 '24 14:02 iercan

Just a note that we no longer support Superset 2.x. Is anyone able to repro this in 3.x?

rusackas avatar Feb 29 '24 23:02 rusackas

I have encountered this issue with Superset 3.1.0 as well. Any pointers on how to resolve this?

swaresh avatar Apr 05 '24 08:04 swaresh

I've encountered the same issue with Superset 4.0.2 on both Chrome and Firefox browsers. The query runs without any problems when using DBeaver (JDBC client) and completes in just 2 seconds.

image

image

SGH-N avatar Sep 19 '24 09:09 SGH-N

HI @SGH-N are you able to solve this problem? i am also facing the same issue could you please help.

Achintyarai22 avatar Dec 05 '24 11:12 Achintyarai22

HI @SGH-N are you able to solve this problem? i am also facing the same issue could you please help.

No, we couldn't find a fix..

SGH-N avatar Dec 05 '24 13:12 SGH-N

HI @SGH-N are you able to solve this problem? i am also facing the same issue could you please help.

No, we couldn't find a fix..

thank you for your response @SGH-N

Achintyarai22 avatar Dec 06 '24 05:12 Achintyarai22

We have hit this issue on 3.0.3 where a user's dashboard loaded a bunch of Trino queries that were stuck in WAITING state. This lead to the server being completely unresponsive and even liveness probe failures in our k8s deployment.

kekwan avatar Dec 12 '24 02:12 kekwan

Thank you for reporting this issue. However, the issue was reported on a version of Superset that we no longer support or it does not contain a valid Superset version. As of this moment, we only actively support Superset 4.0 and 4.1. To maintain a more actionable Issues backlog, we're going to close this issue for now. If you (or anyone reading this) are still experiencing this on a currently supported version of Superset, please either reopen this Issue, or file a new one with updated context (screenshots, reproduction steps) and we'll do our best to support it. Thank you

michael-s-molina avatar Jan 23 '25 19:01 michael-s-molina