dvc
dvc copied to clipboard
Guard rwlock file to prevent not locked repo edits
User were getting:
dvc.rwlock.RWLockFileCorruptedError: Unable to read RWLock-file '.dvc/tmp/rwlock'. JSON structure is corrupted
when running multiple pipelines in parallel.
The reason for this is again our unlocked(cmd) logic, which fails to acquire the repo.lock back if other pipelines are doing some heavy stuff or there are too many of them. It means that, rwlock on context manager exit is cleaning up itself while repo is not protected. Multiple processes could write simultaneously.
Overall, proper locking is pretty hard. I don't see good options for Python in terms of IPC or something that would be potentially faster / more reliable (could cleanup itself if process fails). For now I think, it's better to have this as a workaround, until we reconsider this parallel execution in general (e.g. deprecate it in favor of queue?).
One way would be to make rwlock itself granular. eg: .dvc/tmp/rwlocks/<path>.lock. The path could be a path or a checksum of the path too. That way, we can replace the file atomically without any guards.
@skshetry it'a a bit more complicated as far as I understand. It would have to do multiple lock files at once + there R and W locks so we would have to maintain a few locations. It probably can be done, but we would have to guarantee still atomicity, lack of deadlocks (ordering), etc. Locking in general can be very complicated and the simpler system is the better (my preference here - don't try to be smart :) ). Even in this "simple" scenario - one global lock + one internal rwlock it has gotten very complicated because of just one additional piece - an internal unlock. We started hitting edge cases, corruptions, etc.
@efiop agreed, it feels weird, looks complicated. But if we think about this - rwlock was written in way that it needs protection (that repo.lock serves as a guard). It just happened that the initial assumption that repo.lock protects- it was wrong.
Doing IPC right is very hard. Even in the current code I still see a few potential edge cases:
- guard fails (too much pressure) - and we are left with rwlock that needs manual cleanup or needs to be completely removed. It's terrible UX.
- I would make
rwlockwrites withmvto make them atomic for sure. If something happens now when write is happening we'll be getting a corrupted file again.
But the best locking logic is not to deal with it all :) My take is that probably we can deprecate parallel repro runs in favor of experiments queue.