aim
aim copied to clipboard
RocksIOError: ....../CURRENT: no such file or directory
🐛 Bug
To reproduce
When I created hundreds of runs, I sometimes encounter the following error.
The script I run is as follows:
The command is : python create_runs.py -n 900
Environment
- Aim Version (e.g., 3.0.1): 3.16.1
- Python version: 3.9.15
- pip version: 22.3.1
- OS (e.g., Linux): Centos-8
Hey @thuzhf! Thanks for reporting the issue. This looks strange as the script properly closes the Run 🤔. Will try to reproduce the issue on my side and get back to you.
I'm getting the same error when running with Docker
docker run --publish 43800:43800 aimstack/aim
Unable to find image 'aimstack/aim:latest' locally
latest: Pulling from aimstack/aim
f7a1c6dad281: Pull complete
92c59ec44e08: Pull complete
49e05d2afc27: Pull complete
554fa77b713e: Pull complete
ce447e68bf76: Pull complete
f8b80d0f69e6: Pull complete
496f8dd5cb07: Pull complete
94fbdfbf2302: Pull complete
a115b938a2e6: Pull complete
Digest: sha256:f6fe13ca887a50056be6e52796cdf36b13db10cc30e0bb779b0f63cd74f8064a
Status: Downloaded newer image for aimstack/aim:latest
Traceback (most recent call last):
File "/usr/local/bin/aim", line 8, in <module>
sys.exit(cli_entry_point())
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/aimcore/cli/up/commands.py", line 78, in up
repo_inst = Repo.from_path(repo, read_only=True)
File "/usr/local/lib/python3.9/site-packages/aim/_sdk/repo.py", line 56, in from_path
repo = cls(path, read_only=read_only)
File "/usr/local/lib/python3.9/site-packages/aim/_sdk/repo.py", line 67, in __init__
self._storage_engine = LocalStorage(self.path, read_only=read_only)
File "/usr/local/lib/python3.9/site-packages/aim/_sdk/local_storage.py", line 24, in __init__
self.container: StorageContainer = RocksContainer(self.path, read_only=read_only)
File "src/py-sdk/aim/_core/storage/rockscontainer.pyx", line 108, in aim._core.storage.rockscontainer.RocksContainer.__init__
File "src/py-sdk/aim/_core/storage/rockscontainer.pyx", line 150, in aim._core.storage.rockscontainer.RocksContainer.db
File "src/aimrocks/lib_rocksdb.pyx", line 1686, in aimrocks.lib_rocksdb.DB.__cinit__
File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /opt/aim/.aim/data/CURRENT: No such file or directory'
...as the script properly closes the Run.
Is this a requirement? I suppose the correct program flow would then be:
def do_stuff(run):
...
try:
run = aim.Run(...)
do_stuff(run)
finally:
run.close()
It might also be a good idea to mention Run.close()
in the examples in the documentation.
However, there are many possible scenarios in which run.close()
may never run. Thus, it would be best if the next time that the Aim repository is interacted with (e.g. a new run is created), that it "closes" any dangling runs that have timed out (after an expected heartbeat of e.g. 30 seconds) by updating the relevant indexes. All in a safe, incorruptible, atomic manner, and ideally, without deadlocking for too long (>30 seconds), and communicating to the user if temporarily deadlocked via a warning. I suppose such a mechanism would require writes to the filesystem or tempfs every 30s. From what I saw when looking at the code briefly, I'm guessing Aim already does something similar to this (if not exactly).
I have the same issue - although I have found out here that I wasn't closing the aim run properly - it should be in the docs for sure!
in the meantime, is there any way to repair the database? with this error UI is not usable as far as I can tell?
I had the same issue, I ended up fixing the DB by running:
aim runs rm <ID of the broken run>
The ID can be found in the error message:
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /home/hgaiser/.aim/meta/chunks/<** ID of the broken run **>/CURRENT: No such file or directory'
Not a fix of course, but might help someone reading this issue to at least continue using AIM.
I'm experiencing a similar issue, but instead of CURRENT
file, the aim
looks for missing MANIFEST-000004
.
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /Users/.../.aim/meta/chunks/e861886bfcb144459f7184ed/MANIFEST-000004: No such file or directory'
I had the same issue, I ended up fixing the DB by running:
aim runs rm <ID of the broken run>
The ID can be found in the error message:
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /home/hgaiser/.aim/meta/chunks/<** ID of the broken run **>/CURRENT: No such file or directory'
Not a fix of course, but might help someone reading this issue to at least continue using AIM.
Still getting this error after closing and removing the run...
Any development on this?
Possibly related fix https://github.com/aimhubio/aim/pull/1275