kedro Fixe using `ThreadRunner` with dataset factories

Fixe using `ThreadRunner` with dataset factories

Open ElenaKhaustova opened this issue 6 months ago • 23 comments

Description

Fixes https://github.com/kedro-org/kedro/issues/4007

The issue is not deterministic and appears depending on the order in which threads are executed.

The issue appears for ThreadRunner when loading the same pattern dataset. Since patterns are resolved at runtime, corresponding datasets are added upon execution: load() -> _get_dataset() -> add(). Inside _get_dataset() https://github.com/kedro-org/kedro/blob/2a97dd4166ba20b5bf149cfeb2be6cebf9f5008c/kedro/io/data_catalog.py#L423 we check if dataset is in catalog and then call add() based on the state of the preceding check. The issue appears when first thread saved the state dataset is not in catalog but another thread added this dataset right after. So the first thread holds the outdated state based on which we try to add dataset to the catalog again.

This issue does not affect ParallelRunner because we never share objects between processes, so in each process, we have separate catalogs.

Development notes

The main source of the issue is that catalog.load() does not only read but also write, and the catalog itself is not thread-safe now. So it feels like it's a major problem which we can temporarily address as suggested and make it properly when runners redesign work.

There are two solutions suggested:

The first solution includes adding threading.Lock for load() dataset (which does writing by calling add() inside) function which is applied to threads. With it, we check if dataset in the catalog and add dataset together as the whole dataset.load() is protected by the lock. The possible downside of this solution is affected performance but it should only affect the case when load() is done via calling an external API, because otherwise, we still cannot do simultaneous readings because of GIL.
The second solution includes resolving all patterns before passing the catalog to the runner, so we just avoid cases when load() calls add() as all patterns are resolved based on the pipeline datasets. This solution neglects lazy pattern resolution and will not work with dynamic pipelines.

I prefer the first solution cause it feels wrong to write to a shared object without locking it.

Note: if testing you need to run it multiple times to reproduce the error as it's not deterministic.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

[ ] Read the contributing guidelines
[ ] Signed off each commit with a Developer Certificate of Origin (DCO)
[ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[ ] Updated the documentation to reflect the code changes
[ ] Added a description of this change in the RELEASE.md file
[ ] Added tests to cover my changes
[ ] Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Aug 15 '24 10:08 ElenaKhaustova

kedro kedro copied to clipboard

Fixe using `ThreadRunner` with dataset factories

Description

Development notes

Developer Certificate of Origin

Checklist

kedro
kedro copied to clipboard