dvc Guard rwlock file to prevent not locked repo edits

trafficstars

User were getting:

dvc.rwlock.RWLockFileCorruptedError: Unable to read RWLock-file '.dvc/tmp/rwlock'. JSON structure is corrupted

when running multiple pipelines in parallel.

The reason for this is again our unlocked(cmd) logic, which fails to acquire the repo.lock back if other pipelines are doing some heavy stuff or there are too many of them. It means that, rwlock on context manager exit is cleaning up itself while repo is not protected. Multiple processes could write simultaneously.

Overall, proper locking is pretty hard. I don't see good options for Python in terms of IPC or something that would be potentially faster / more reliable (could cleanup itself if process fails). For now I think, it's better to have this as a workaround, until we reconsider this parallel execution in general (e.g. deprecate it in favor of queue?).

Aug 10 '22 05:08 shcheklein

One way would be to make rwlock itself granular. eg: .dvc/tmp/rwlocks/<path>.lock. The path could be a path or a checksum of the path too. That way, we can replace the file atomically without any guards.

Aug 10 '22 08:08 skshetry

@skshetry it'a a bit more complicated as far as I understand. It would have to do multiple lock files at once + there R and W locks so we would have to maintain a few locations. It probably can be done, but we would have to guarantee still atomicity, lack of deadlocks (ordering), etc. Locking in general can be very complicated and the simpler system is the better (my preference here - don't try to be smart :) ). Even in this "simple" scenario - one global lock + one internal rwlock it has gotten very complicated because of just one additional piece - an internal unlock. We started hitting edge cases, corruptions, etc.

Aug 10 '22 16:08 shcheklein

@efiop agreed, it feels weird, looks complicated. But if we think about this - rwlock was written in way that it needs protection (that repo.lock serves as a guard). It just happened that the initial assumption that repo.lock protects- it was wrong.

Doing IPC right is very hard. Even in the current code I still see a few potential edge cases:

guard fails (too much pressure) - and we are left with rwlock that needs manual cleanup or needs to be completely removed. It's terrible UX.
I would make rwlock writes with mv to make them atomic for sure. If something happens now when write is happening we'll be getting a corrupted file again.

But the best locking logic is not to deal with it all :) My take is that probably we can deprecate parallel repro runs in favor of experiments queue.

Aug 11 '22 16:08 shcheklein

dvc dvc copied to clipboard

Guard rwlock file to prevent not locked repo edits

dvc
dvc copied to clipboard