dagster
dagster copied to clipboard
Race condition DuckDB in Dagster Essentials
What's the issue or suggestion?
I am following the Dagster University course Dagster Essentials and encountered a race condition at the end of Lesson 4: Asset dependencies.
The assets have the following lineage:
taxi_trips and taxi_zones both write data to the DuckDB database and manhattan_stats as well as trips_by_week read from it. With DuckDB it is only possible to have one open read/write connection or multiple read connections at the same time. In my case a race condition was triggered due to multiple open open read/write connections at the same time.
It is possible to set read_only=True when connecting to the database in manhattan_stats as well as trips_by_week, to reduce the problem. However, it is still possible to get a race condition with taxi_trips and taxi_zones, since both of them need a read/write connection. Nonetheless, it is possible to circumvent the problem with IOManagers or with using op_tags and setting the maximum parallel execution of assets with this tag to 1. But they are more advanced features.
So maybe it would be a solution to have only read connections for manhattan_stats as well as trips_by_week and add a warning that in case of an error due to a DuckDB connection issue it should be sufficient to re-execute the pipeline?
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
Thank you @kornerc - your proposition of using read-only connectors and concurrency limits for partitioned assets is quite elegant. I agree that it would be worthwhile to add a note of this in the lesson, as I recall a few other community members running into this issue too.
I will make a ticket for us internally to update the lesson accordingly, and will update here once complete. Thank you!
@cmpadden thank you for your reply and the information.
Another alternative solution came into my mind: using backoff when connecting to the database, as it is done in the DuckDB resource https://github.com/dagster-io/dagster/blob/f88e4905eadcca34a0b5c7d4f385185ecc660172/python_modules/libraries/dagster-duckdb/dagster_duckdb/resource.py#L51
This might be the simplest solution because it is not necessary to manually limit the concurrency and introducing new Dagster concepts.
Good call! I really like that approach. Thanks for following up.
I am following up on the above because this was an issue I was also running into and would be great to see this implemented!
Thanks again for the suggestion of using backoff. We updated the course material to include that.
You are welcome. Thanks for the great Dagster documentation and Dagster itself 🙂