aim Lock acquire Fail when training in Phase

🐛 Bug

The bug is that during training in phases it occurs that aim fails to acquire the lock during phase 2 training. It has to acquire the lock created during phase 1

To reproduce

Initiate a multi-phase training

Expected behavior

Expected behaviour is that aim acquires the lock based on the tag name being same

Environment

Aim Version - Latest
Python version - 3.7.13
pip version - 20.1.1
OS- Linux

Jul 31 '22 13:07 vishalghor

Hi @vishalghor! Thanks for raising the issue. May I ask you to share some more details? Particularly the following information would be very helpful:

Error message/stack trace (if available)
Could you please specify what you mean by this:

aim acquires the lock based on the tag name being same.

Ideally a small code snippet showing how the aim.Run is initialized/cleaned-up between phases.

Aug 01 '22 06:08 alberttorosyan

Hi @alberttorosyan ,

Below is the error trace:

File "/home/vghorpad/upscalers_pytorch/upscalers_pytorch/aim.py", line 36, in __init__
   self.run = Run(repo=log_dir, run_hash=run_hash)
 File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/run.py", line 283, in __init__
   super().__init__(run_hash, repo=repo, read_only=read_only)
 File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/base_run.py", line 44, in __init__
   'meta', self.hash, read_only=read_only, from_union=True
 File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/repo.py", line 354, in request_tree
   return self.request(name, sub, read_only=read_only, from_union=from_union).tree()
 File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/repo.py", line 380, in request
   container = self._get_container(path, read_only=False, from_union=False)
 File "/home/vghorpad/.local/lib/python3.7/site-packages/aim/sdk/repo.py", line 320, in _get_container
   container = RocksContainer(path, read_only=read_only)
 File "aim/storage/rockscontainer.pyx", line 104, in aim.storage.rockscontainer.RocksContainer.__init__
 File "aim/storage/rockscontainer.pyx", line 149, in aim.storage.rockscontainer.RocksContainer.writable_db
 File "aim/storage/rockscontainer.pyx", line 137, in aim.storage.rockscontainer.RocksContainer.db
 File "/opt/conda/lib/python3.7/site-packages/filelock/_api.py", line 177, in acquire
   raise Timeout(self._lock_file)
filelock._error.Timeout: The file lock '/data/aim/.aim/meta/locks/8974c36b5c794b22b6be0174' could not be acquired.

By 'aim acquires...' I mean that during training in phases aim looks/tries to reuse the same lock for phase 2 which was used in phase 1. If the lock acquisition has some check with tag name this might help resolve errors like the above. But mostly there is a much better approach .

Aug 01 '22 06:08 vishalghor

Thanks for sharing the details @vishalghor. The error message basically means that there is a process still holding the lock for the run 8974c36b5c794b22b6be0174. I assume that the phases of the training in your case are two separate processes? If that's the case, could you please make sure that the Phase I process is done before creating aim.Run object in Phase II? If that is not possible due to some reasons, the Run.close() method can be used in order to free-up all the resources Run object holds.

Aug 01 '22 07:08 alberttorosyan

Thank you @alberttorosyan for the information. In my case if I'm executing single phase training I don't see any issues to run back to back runs. But in the two phase training the same run needs to handle both the phases. My first phase training goes for around 150 epochs followed by which I load the saved model and run 50 more epochs with changed learning rate in phase 2. As per my understanding within the same run I should'nt need to close the run as I want it to be part of the same run. Please let me know if I'm missing out something and how to go about it.

Aug 01 '22 16:08 vishalghor

aim aim copied to clipboard

Lock acquire Fail when training in Phase

🐛 Bug

To reproduce

Expected behavior

Environment

aim
aim copied to clipboard