aim icon indicating copy to clipboard operation
aim copied to clipboard

Concurrent runs results in corrupt repositories

Open JesseFarebro opened this issue 1 year ago • 6 comments

🐛 Bug

I'm launching ~10 jobs that will write to the same Aim repository. Consistently this will result in a multitude of issues that corrupts the Aim repository. We had ran a similar test on our cluster around December of 2022 and from my understanding there were no issues at the time.

To reproduce

I'm on a Slurm cluster and I've created a minimal reproduction that includes the following job script:

test_aim_concurrent_writes.sh
#!/bin/bash
#SBATCH --array=1-10
#SBATCH --cpus-per-task=2
#SBATCH --mem=4GB

# Execute the script for each job in the array
python test_aim_concurrent_writes.py

along with the following Python script:

test_aim_concurrent_writes.py
import time

import numpy
from aim import Run


class LearningCurve:
    def __init__(self, epochs, accuracy, seed: int | None = None):
        self.rng = numpy.random.RandomState(seed)
        self.asymptote = self.rng.uniform(accuracy * 0.95, 1)
        self.accuracy = accuracy
        self.epochs = epochs
        self.steepness = (-1 - (accuracy - self.asymptote)) / (epochs * (accuracy - self.asymptote))

    def __call__(self, epoch: float):
        accuracy = -numpy.abs(1 / (self.steepness * epoch + 1)) + self.asymptote
        return min(
            max(
                accuracy + self.rng.normal(0, max((self.asymptote - accuracy) / 4, 0.0001)),
                0,
            ),
            1,
        )


def main():
    epochs: int = 5_000
    steps: int = 1_000
    epoch_interval: float = 5.0
    step_interval: float = 2.0

    train_curve = LearningCurve(epochs, 1.0)
    valid_curve = LearningCurve(epochs, 0.95)
    test_curve = LearningCurve(epochs, 0.95)
    run = Run()

    for epoch in range(epochs):
        for step in range(steps):
            for name, foo in [
                ("train", train_curve),
                ("valid", valid_curve),
                ("test", test_curve),
            ]:
                run.track(
                    foo(epoch + step / steps),
                    name=name,
                    epoch=epoch,
                )

            run.track(
                step,
                name="step",
                epoch=epoch,
            )

            time.sleep(step_interval)

        time.sleep(epoch_interval)


if __name__ == "__main__":
    main()

To reproduce:

  1. Schedule the job (or maybe you could just run 10 processes without access to Slurm)
  2. Run aim up
  3. Try to navigate to any page and you'll see errors everywhere.

I've listed the most common stack traces at the bottom of this post but there's even more errors than this, e.g., there's sqlite errors not being able to aquire a lock, and some other errors about files not being found

Expected behavior

All jobs successfully write data to Aim without error.

Environment

  • Aim Version: 3.17.5
  • Python version: 3.10.11
  • OS: Ubuntu 18.04
  • Filesystem: BeeGFS

Additional context

Stack Traces

`aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'`
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
    await responder(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File ".env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
    await func()
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 205, in metric_search_result_streamer
    for trace in run_trace_collection.iter():
  File ".venv/lib/python3.10/site-packages/aim/sdk/sequence_collection.py", line 119, in iter
    for seq_name, ctx, run in self.run.iter_sequence_info_by_type(allowed_dtypes):
  File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 464, in iter_sequence_info_by_type
    for ctx_idx, run_ctx_dict in self.meta_run_tree.subtree('traces').items():
  File "aim/storage/containertreeview.py", line 152, in items
  File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
  File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
  File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
  File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 80, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'
`aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'`
Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
    await responder(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File ".venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
    await func()
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 278, in run_search_result_streamer
    run_dict[run.hash]['traces'] = run.collect_sequence_info(sequence_types='metric')
  File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 665, in collect_sequence_info
    ctx_dict = self.idx_to_ctx(idx).to_dict()
  File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 336, in idx_to_ctx
    return self._tracker.idx_to_ctx(idx)
  File ".venv/lib/python3.10/site-packages/aim/sdk/tracker.py", line 80, in idx_to_ctx
    ctx = Context(self.meta_tree['contexts', idx])
  File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
  File "aim/storage/containertreeview.py", line 69, in aim.storage.containertreeview.ContainerTreeView.collect
  File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
  File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
  File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
  File "aim/storage/union.pyx", line 60, in aim.storage.union.ItemsIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'

JesseFarebro avatar Jun 22 '23 18:06 JesseFarebro

Hey @JesseFarebro! Thanks for reporting this issue and providing scripts. Will take a look and get back to you.

alberttorosyan avatar Jun 26 '23 07:06 alberttorosyan

@alberttorosyan any ways to recover the corrupted aim database?

ArmandXiao avatar Aug 14 '23 02:08 ArmandXiao

@ArmandXiao I can't help with recovery but I work around this issue by having a separate Aim repository per run then using the CLI to copy all the runs to a single repository to analyze afterwards.

JesseFarebro avatar Aug 14 '23 02:08 JesseFarebro

@ArmandXiao I can't help with recovery but I work around this issue by having a separate Aim repository per run then using the CLI to copy all the runs to a single repository to analyze afterwards.

Many thanks for the prompt reply. My concern is that I have over 5000 runs and clearly know the latest 10 runs corrupt the repository. Is there a better work around to deal with this specific issue since creating a repository per run for 5000 runs is quite cumbersome.

ArmandXiao avatar Aug 14 '23 03:08 ArmandXiao

@JesseFarebro I figured out a workaround for this.

  1. use aim runs rm to delete all the unfinished runs. The hash for runs can be found at .aim/meta/progress.
  2. manually delete all runs in .aim/meta/progress.

This works for me. Please make a copy before deleting anything in case this method does not work for u.

ArmandXiao avatar Aug 14 '23 03:08 ArmandXiao

Thanks @ArmandXiao for the work around. I thought I lost all my previous runs 😱 for a moment.

And pinging to increase the priority if possible. IMO this is quite an important problem because:

  1. Corrupt database. Luckily old runs can be recovered but it's scary nonetheless.
  2. Aim UI for viewing previous exp is not available anymore.
  3. Script to launch at different repo path then move after it's done is cumbersome, and can not compare run while it's running.

lkhphuc avatar Aug 24 '23 14:08 lkhphuc