aim
aim copied to clipboard
Lock acquire Fail when training in Phase
🐛 Bug
The bug is that during training in phases it occurs that aim fails to acquire the lock during phase 2 training. It has to acquire the lock created during phase 1
To reproduce
Initiate a multi-phase training
Expected behavior
Expected behaviour is that aim acquires the lock based on the tag name being same
Environment
- Aim Version - Latest
- Python version - 3.7.13
- pip version - 20.1.1
- OS- Linux
Hi @vishalghor! Thanks for raising the issue. May I ask you to share some more details? Particularly the following information would be very helpful:
- Error message/stack trace (if available)
- Could you please specify what you mean by this:
aim acquires the lock based on the tag name being same.
Ideally a small code snippet showing how the aim.Run
is initialized/cleaned-up between phases.
Hi @alberttorosyan ,
Below is the error trace:
File "/home/vghorpad/upscalers_pytorch/upscalers_pytorch/aim.py", line 36, in __init__
self.run = Run(repo=log_dir, run_hash=run_hash)
File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/run.py", line 283, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only)
File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/base_run.py", line 44, in __init__
'meta', self.hash, read_only=read_only, from_union=True
File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/repo.py", line 354, in request_tree
return self.request(name, sub, read_only=read_only, from_union=from_union).tree()
File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/repo.py", line 380, in request
container = self._get_container(path, read_only=False, from_union=False)
File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/repo.py", line 320, in _get_container
container = RocksContainer(path, read_only=read_only)
File "aim/storage/rockscontainer.pyx", line 104, in aim.storage.rockscontainer.RocksContainer.__init__
File "aim/storage/rockscontainer.pyx", line 149, in aim.storage.rockscontainer.RocksContainer.writable_db
File "aim/storage/rockscontainer.pyx", line 137, in aim.storage.rockscontainer.RocksContainer.db
File "/opt/conda/lib/python3.7/site-packages/filelock/_api.py", line 177, in acquire
raise Timeout(self._lock_file)
filelock._error.Timeout: The file lock '/data/aim/.aim/meta/locks/8974c36b5c794b22b6be0174' could not be acquired.
By 'aim acquires...' I mean that during training in phases aim looks/tries to reuse the same lock for phase 2 which was used in phase 1. If the lock acquisition has some check with tag name this might help resolve errors like the above. But mostly there is a much better approach .
Thanks for sharing the details @vishalghor.
The error message basically means that there is a process still holding the lock for the run 8974c36b5c794b22b6be0174
. I assume that the phases of the training in your case are two separate processes? If that's the case, could you please make sure that the Phase I process is done before creating aim.Run
object in Phase II? If that is not possible due to some reasons, the Run.close()
method can be used in order to free-up all the resources Run
object holds.
Thank you @alberttorosyan for the information. In my case if I'm executing single phase training I don't see any issues to run back to back runs. But in the two phase training the same run needs to handle both the phases. My first phase training goes for around 150 epochs followed by which I load the saved model and run 50 more epochs with changed learning rate in phase 2. As per my understanding within the same run I should'nt need to close the run as I want it to be part of the same run. Please let me know if I'm missing out something and how to go about it.