[air/tune][multi-tenancy] Parallel runs can use the same experiment directory
What happened + What you expected to happen
As uncovered in #35004, when running multiple runs (training or tuning) in parallel on the same cluster (multi-tenancy), we can be in a situation where two runs "share" the same experiment directory.
This can be e.g. when the experiment directory name is explicitly set to the same one, or when the runs are started within the same whole second with the same trainable name (as they will have the same "date suffix").
In the latter case, the experiment state files will also conflict.
These conflicts will lead to problems. As a minimum, restoration will not work for at least one of the runs, as its experiment state is either overwritten, or at least older than the other run's state file.
More problems are bound to come up.
We should try to detect if an experiment directory is already actively used (e.g. using a time-based filelock and process probing) and raise an error if so.
Versions / Dependencies
master
Reproduction script
Issue Severity
Low: It annoys or frustrates me.
This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.
Please comment and remove the pending-cleanup label if you believe this issue should remain open.
Thanks for contributing to Ray!