dagster icon indicating copy to clipboard operation
dagster copied to clipboard

Race condition DuckDB in Dagster Essentials

Open kornerc opened this issue 1 year ago • 3 comments

What's the issue or suggestion?

I am following the Dagster University course Dagster Essentials and encountered a race condition at the end of Lesson 4: Asset dependencies.

The assets have the following lineage: Global_Asset_Lineage

taxi_trips and taxi_zones both write data to the DuckDB database and manhattan_stats as well as trips_by_week read from it. With DuckDB it is only possible to have one open read/write connection or multiple read connections at the same time. In my case a race condition was triggered due to multiple open open read/write connections at the same time.

It is possible to set read_only=True when connecting to the database in manhattan_stats as well as trips_by_week, to reduce the problem. However, it is still possible to get a race condition with taxi_trips and taxi_zones, since both of them need a read/write connection. Nonetheless, it is possible to circumvent the problem with IOManagers or with using op_tags and setting the maximum parallel execution of assets with this tag to 1. But they are more advanced features.

So maybe it would be a solution to have only read connections for manhattan_stats as well as trips_by_week and add a warning that in case of an error due to a DuckDB connection issue it should be sufficient to re-execute the pipeline?

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

kornerc avatar Jun 24 '24 15:06 kornerc

Thank you @kornerc - your proposition of using read-only connectors and concurrency limits for partitioned assets is quite elegant. I agree that it would be worthwhile to add a note of this in the lesson, as I recall a few other community members running into this issue too.

I will make a ticket for us internally to update the lesson accordingly, and will update here once complete. Thank you!

cmpadden avatar Jun 25 '24 21:06 cmpadden

@cmpadden thank you for your reply and the information.

Another alternative solution came into my mind: using backoff when connecting to the database, as it is done in the DuckDB resource https://github.com/dagster-io/dagster/blob/f88e4905eadcca34a0b5c7d4f385185ecc660172/python_modules/libraries/dagster-duckdb/dagster_duckdb/resource.py#L51 This might be the simplest solution because it is not necessary to manually limit the concurrency and introducing new Dagster concepts.

kornerc avatar Jun 26 '24 16:06 kornerc

Good call! I really like that approach. Thanks for following up.

cmpadden avatar Jun 26 '24 20:06 cmpadden

I am following up on the above because this was an issue I was also running into and would be great to see this implemented!

aleexharris avatar Oct 21 '24 11:10 aleexharris

Thanks again for the suggestion of using backoff. We updated the course material to include that.

dehume avatar Dec 13 '24 14:12 dehume

You are welcome. Thanks for the great Dagster documentation and Dagster itself 🙂

kornerc avatar Dec 18 '24 17:12 kornerc