aim
aim copied to clipboard
Concurrent runs results in corrupt repositories
🐛 Bug
I'm launching ~10 jobs that will write to the same Aim repository. Consistently this will result in a multitude of issues that corrupts the Aim repository. We had ran a similar test on our cluster around December of 2022 and from my understanding there were no issues at the time.
To reproduce
I'm on a Slurm cluster and I've created a minimal reproduction that includes the following job script:
test_aim_concurrent_writes.sh
#!/bin/bash
#SBATCH --array=1-10
#SBATCH --cpus-per-task=2
#SBATCH --mem=4GB
# Execute the script for each job in the array
python test_aim_concurrent_writes.py
along with the following Python script:
test_aim_concurrent_writes.py
import time
import numpy
from aim import Run
class LearningCurve:
def __init__(self, epochs, accuracy, seed: int | None = None):
self.rng = numpy.random.RandomState(seed)
self.asymptote = self.rng.uniform(accuracy * 0.95, 1)
self.accuracy = accuracy
self.epochs = epochs
self.steepness = (-1 - (accuracy - self.asymptote)) / (epochs * (accuracy - self.asymptote))
def __call__(self, epoch: float):
accuracy = -numpy.abs(1 / (self.steepness * epoch + 1)) + self.asymptote
return min(
max(
accuracy + self.rng.normal(0, max((self.asymptote - accuracy) / 4, 0.0001)),
0,
),
1,
)
def main():
epochs: int = 5_000
steps: int = 1_000
epoch_interval: float = 5.0
step_interval: float = 2.0
train_curve = LearningCurve(epochs, 1.0)
valid_curve = LearningCurve(epochs, 0.95)
test_curve = LearningCurve(epochs, 0.95)
run = Run()
for epoch in range(epochs):
for step in range(steps):
for name, foo in [
("train", train_curve),
("valid", valid_curve),
("test", test_curve),
]:
run.track(
foo(epoch + step / steps),
name=name,
epoch=epoch,
)
run.track(
step,
name="step",
epoch=epoch,
)
time.sleep(step_interval)
time.sleep(epoch_interval)
if __name__ == "__main__":
main()
To reproduce:
- Schedule the job (or maybe you could just run 10 processes without access to Slurm)
- Run
aim up
- Try to navigate to any page and you'll see errors everywhere.
I've listed the most common stack traces at the bottom of this post but there's even more errors than this, e.g., there's sqlite errors not being able to aquire a lock, and some other errors about files not being found
Expected behavior
All jobs successfully write data to Aim without error.
Environment
- Aim Version: 3.17.5
- Python version: 3.10.11
- OS: Ubuntu 18.04
- Filesystem: BeeGFS
Additional context
Stack Traces
`aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'`
ERROR: Exception in ASGI application
Traceback (most recent call last):
File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
await responder(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
await self.app(scope, receive, self.send_with_gzip)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File ".env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
await func()
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 205, in metric_search_result_streamer
for trace in run_trace_collection.iter():
File ".venv/lib/python3.10/site-packages/aim/sdk/sequence_collection.py", line 119, in iter
for seq_name, ctx, run in self.run.iter_sequence_info_by_type(allowed_dtypes):
File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 464, in iter_sequence_info_by_type
for ctx_idx, run_ctx_dict in self.meta_run_tree.subtree('traces').items():
File "aim/storage/containertreeview.py", line 152, in items
File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 80, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'
`aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'`
Traceback (most recent call last):
File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
await responder(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
await self.app(scope, receive, self.send_with_gzip)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File ".venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
await func()
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 278, in run_search_result_streamer
run_dict[run.hash]['traces'] = run.collect_sequence_info(sequence_types='metric')
File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 665, in collect_sequence_info
ctx_dict = self.idx_to_ctx(idx).to_dict()
File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 336, in idx_to_ctx
return self._tracker.idx_to_ctx(idx)
File ".venv/lib/python3.10/site-packages/aim/sdk/tracker.py", line 80, in idx_to_ctx
ctx = Context(self.meta_tree['contexts', idx])
File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
File "aim/storage/containertreeview.py", line 69, in aim.storage.containertreeview.ContainerTreeView.collect
File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
File "aim/storage/union.pyx", line 60, in aim.storage.union.ItemsIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'
Hey @JesseFarebro! Thanks for reporting this issue and providing scripts. Will take a look and get back to you.
@alberttorosyan any ways to recover the corrupted aim database?
@ArmandXiao I can't help with recovery but I work around this issue by having a separate Aim repository per run then using the CLI to copy all the runs to a single repository to analyze afterwards.
@ArmandXiao I can't help with recovery but I work around this issue by having a separate Aim repository per run then using the CLI to copy all the runs to a single repository to analyze afterwards.
Many thanks for the prompt reply. My concern is that I have over 5000 runs and clearly know the latest 10 runs corrupt the repository. Is there a better work around to deal with this specific issue since creating a repository per run for 5000 runs is quite cumbersome.
@JesseFarebro I figured out a workaround for this.
- use
aim runs rm
to delete all the unfinished runs. The hash for runs can be found at.aim/meta/progress
. - manually delete all runs in
.aim/meta/progress
.
This works for me. Please make a copy before deleting anything in case this method does not work for u.
Thanks @ArmandXiao for the work around. I thought I lost all my previous runs 😱 for a moment.
And pinging to increase the priority if possible. IMO this is quite an important problem because:
- Corrupt database. Luckily old runs can be recovered but it's scary nonetheless.
- Aim UI for viewing previous exp is not available anymore.
- Script to launch at different repo path then move after it's done is cumbersome, and can not compare run while it's running.