ray icon indicating copy to clipboard operation
ray copied to clipboard

[air/tune][multi-tenancy] Parallel runs can use the same experiment directory

Open krfricke opened this issue 2 years ago • 1 comments

What happened + What you expected to happen

As uncovered in #35004, when running multiple runs (training or tuning) in parallel on the same cluster (multi-tenancy), we can be in a situation where two runs "share" the same experiment directory.

This can be e.g. when the experiment directory name is explicitly set to the same one, or when the runs are started within the same whole second with the same trainable name (as they will have the same "date suffix").

In the latter case, the experiment state files will also conflict.

These conflicts will lead to problems. As a minimum, restoration will not work for at least one of the runs, as its experiment state is either overwritten, or at least older than the other run's state file.

More problems are bound to come up.

We should try to detect if an experiment directory is already actively used (e.g. using a time-based filelock and process probing) and raise an error if so.

Versions / Dependencies

master

Reproduction script

Issue Severity

Low: It annoys or frustrates me.

krfricke avatar May 03 '23 14:05 krfricke

This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.

Please comment and remove the pending-cleanup label if you believe this issue should remain open.

Thanks for contributing to Ray!

cszhu avatar Jun 17 '25 00:06 cszhu