aim icon indicating copy to clipboard operation
aim copied to clipboard

RocksIOError: ....../CURRENT: no such file or directory

Open thuzhf opened this issue 1 year ago • 8 comments

🐛 Bug

To reproduce

When I created hundreds of runs, I sometimes encounter the following error. image

The script I run is as follows: image

The command is : python create_runs.py -n 900

Environment

  • Aim Version (e.g., 3.0.1): 3.16.1
  • Python version: 3.9.15
  • pip version: 22.3.1
  • OS (e.g., Linux): Centos-8

thuzhf avatar Mar 06 '23 19:03 thuzhf

Hey @thuzhf! Thanks for reporting the issue. This looks strange as the script properly closes the Run 🤔. Will try to reproduce the issue on my side and get back to you.

alberttorosyan avatar Mar 07 '23 06:03 alberttorosyan

I'm getting the same error when running with Docker

docker run --publish 43800:43800 aimstack/aim
Unable to find image 'aimstack/aim:latest' locally
latest: Pulling from aimstack/aim
f7a1c6dad281: Pull complete 
92c59ec44e08: Pull complete 
49e05d2afc27: Pull complete 
554fa77b713e: Pull complete 
ce447e68bf76: Pull complete 
f8b80d0f69e6: Pull complete 
496f8dd5cb07: Pull complete 
94fbdfbf2302: Pull complete 
a115b938a2e6: Pull complete 
Digest: sha256:f6fe13ca887a50056be6e52796cdf36b13db10cc30e0bb779b0f63cd74f8064a
Status: Downloaded newer image for aimstack/aim:latest
Traceback (most recent call last):
  File "/usr/local/bin/aim", line 8, in <module>
    sys.exit(cli_entry_point())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/aimcore/cli/up/commands.py", line 78, in up
    repo_inst = Repo.from_path(repo, read_only=True)
  File "/usr/local/lib/python3.9/site-packages/aim/_sdk/repo.py", line 56, in from_path
    repo = cls(path, read_only=read_only)
  File "/usr/local/lib/python3.9/site-packages/aim/_sdk/repo.py", line 67, in __init__
    self._storage_engine = LocalStorage(self.path, read_only=read_only)
  File "/usr/local/lib/python3.9/site-packages/aim/_sdk/local_storage.py", line 24, in __init__
    self.container: StorageContainer = RocksContainer(self.path, read_only=read_only)
  File "src/py-sdk/aim/_core/storage/rockscontainer.pyx", line 108, in aim._core.storage.rockscontainer.RocksContainer.__init__
  File "src/py-sdk/aim/_core/storage/rockscontainer.pyx", line 150, in aim._core.storage.rockscontainer.RocksContainer.db
  File "src/aimrocks/lib_rocksdb.pyx", line 1686, in aimrocks.lib_rocksdb.DB.__cinit__
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /opt/aim/.aim/data/CURRENT: No such file or directory'

mingfang avatar May 24 '23 15:05 mingfang

...as the script properly closes the Run.

Is this a requirement? I suppose the correct program flow would then be:

def do_stuff(run):
    ...

try:
    run = aim.Run(...)
    do_stuff(run)
finally:
    run.close()

It might also be a good idea to mention Run.close() in the examples in the documentation.

However, there are many possible scenarios in which run.close() may never run. Thus, it would be best if the next time that the Aim repository is interacted with (e.g. a new run is created), that it "closes" any dangling runs that have timed out (after an expected heartbeat of e.g. 30 seconds) by updating the relevant indexes. All in a safe, incorruptible, atomic manner, and ideally, without deadlocking for too long (>30 seconds), and communicating to the user if temporarily deadlocked via a warning. I suppose such a mechanism would require writes to the filesystem or tempfs every 30s. From what I saw when looking at the code briefly, I'm guessing Aim already does something similar to this (if not exactly).

YodaEmbedding avatar Jun 07 '23 08:06 YodaEmbedding

I have the same issue - although I have found out here that I wasn't closing the aim run properly - it should be in the docs for sure!

in the meantime, is there any way to repair the database? with this error UI is not usable as far as I can tell?

hstojic avatar Aug 10 '23 10:08 hstojic

I had the same issue, I ended up fixing the DB by running:

aim runs rm <ID of the broken run>

The ID can be found in the error message:

aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /home/hgaiser/.aim/meta/chunks/<** ID of the broken run **>/CURRENT: No such file or directory'

Not a fix of course, but might help someone reading this issue to at least continue using AIM.

hgaiser avatar Sep 05 '23 14:09 hgaiser

I'm experiencing a similar issue, but instead of CURRENT file, the aim looks for missing MANIFEST-000004.

aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /Users/.../.aim/meta/chunks/e861886bfcb144459f7184ed/MANIFEST-000004: No such file or directory'

awav avatar Oct 07 '23 10:10 awav

I had the same issue, I ended up fixing the DB by running:

aim runs rm <ID of the broken run>

The ID can be found in the error message:

aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While opening a file for sequentially reading: /home/hgaiser/.aim/meta/chunks/<** ID of the broken run **>/CURRENT: No such file or directory'

Not a fix of course, but might help someone reading this issue to at least continue using AIM.

Still getting this error after closing and removing the run...

Any development on this?

Wingmore avatar Jul 08 '24 23:07 Wingmore

Possibly related fix https://github.com/aimhubio/aim/pull/1275

Wingmore avatar Jul 08 '24 23:07 Wingmore