dagster icon indicating copy to clipboard operation
dagster copied to clipboard

Dagster suddenly not queueing runs for a few hours / stuck assets

Open khelzor31415 opened this issue 6 months ago • 1 comments

Dagster version

dagster, version 1.4.11

What's the issue?

In the past few days, our Dagster hosted on an AWS EC2 machine and connected to an RDS database has been acting up. There have been hours-long intervals where it has a heartbeat but does not queue any of the scheduled runs. An asset has also gotten stuck starting for a few hours and caused a large queue to build up.

The running theory is an issue with our RDS database, but I was hoping someone could weigh in on if they've experienced something similar in the past. Here is a code error that's been cropping up as an error in Dagster itself.

Operation name: SingleScheduleQuery

Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT job_ticks.id, job_ticks.tick_body 
FROM job_ticks 
WHERE job_ticks.selector_id = %(selector_id_1)s OR job_ticks.selector_id IS NULL AND job_ticks.job_origin_id = %(job_origin_id_1)s ORDER BY job_ticks.timestamp DESC 
 LIMIT %(param_1)s]
[parameters: {'selector_id_1': '6a61a9d0f8f0a88b4b4344c6eca8caef4daba5e0', 'job_origin_id_1': 'c16b1cb509310a13315cad0c4e834091e723c7f2', 'param_1': 1}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Path: ["scheduleOrError","scheduleState","ticks"]

Locations: [{"line":11,"column":9}]

What did you expect to happen?

Runs are stopping and assets are stuck in the starting phase. This is causing queue build up and scheduled assets not running without Dagster alerting us.

How to reproduce?

No response

Deployment type

Other

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

khelzor31415 avatar Aug 27 '24 03:08 khelzor31415