dagster
dagster copied to clipboard
Dagster suddenly not queueing runs for a few hours / stuck assets
Dagster version
dagster, version 1.4.11
What's the issue?
In the past few days, our Dagster hosted on an AWS EC2 machine and connected to an RDS database has been acting up. There have been hours-long intervals where it has a heartbeat but does not queue any of the scheduled runs. An asset has also gotten stuck starting for a few hours and caused a large queue to build up.
The running theory is an issue with our RDS database, but I was hoping someone could weigh in on if they've experienced something similar in the past. Here is a code error that's been cropping up as an error in Dagster itself.
Operation name: SingleScheduleQuery
Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout
[SQL: SELECT job_ticks.id, job_ticks.tick_body
FROM job_ticks
WHERE job_ticks.selector_id = %(selector_id_1)s OR job_ticks.selector_id IS NULL AND job_ticks.job_origin_id = %(job_origin_id_1)s ORDER BY job_ticks.timestamp DESC
LIMIT %(param_1)s]
[parameters: {'selector_id_1': '6a61a9d0f8f0a88b4b4344c6eca8caef4daba5e0', 'job_origin_id_1': 'c16b1cb509310a13315cad0c4e834091e723c7f2', 'param_1': 1}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Path: ["scheduleOrError","scheduleState","ticks"]
Locations: [{"line":11,"column":9}]
What did you expect to happen?
Runs are stopping and assets are stuck in the starting phase. This is causing queue build up and scheduled assets not running without Dagster alerting us.
How to reproduce?
No response
Deployment type
Other
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.