aim icon indicating copy to clipboard operation
aim copied to clipboard

aimrocks RocksIOError on NFS share

Open pkubik opened this issue 2 years ago β€’ 7 comments

πŸ› Bug

I initialized a directory with aim init on my NFS share, and run several jobs pointing to this directory. Initially all metrics and params were logged properly, but after a while logging stopped and it seems like it started to repeatedly fail from the logs.

aimrocks.errors.RocksIOError: b'IO error: While appending to file: /nas/projects/auto-ml/models/pkubik/pc/.aim/meta/chunks/07a1749cee044a69a80226d82f0d0b0c/000027.log: Input/output error'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /nas/projects/auto-ml/models/pkubik/pc/.aim/meta/chunks/07a1749cee044a69a80226d82f0d0b0c/000027.log: Input/output error'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /nas/projects/auto-ml/models/pkubik/pc/.aim/meta/chunks/07a1749cee044a69a80226d82f0d0b0c/000027.log: Input/output error'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: Stale file handle'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: Stale file handle'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: Stale file handle'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: Stale file handle'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: Stale file handle'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: Stale file handle'

(this repeats several time, I guess on each logging attempt)

After manual inspection the mentioned file doesn't seem to exist anymore. There is a similar log file with a higher number in given directory - in this case 000039.log, and a bunch of other files:

000024.sst  000026.sst  000028.sst  000032.sst  000038.sst  000039.log  000040.sst  CURRENT  IDENTITY  LOCK  LOG  LOG.old.1654955991303666  MANIFEST-000033  OPTIONS-000007  OPTIONS-000036

To reproduce

Setup training job with:

aim_run = aim.Run(
    run_hash=name,
    repo=os.environ.get("AIM_ROOT_DIR"), # shared NFS directory for multiple jobs
    experiment=experiment_name,
)

Proceed with training, logging a set of metrics one every minute.

Expected behavior

Metrics are logged continuously.

Environment

  • Aim v3.10.3
  • Python 3.8.12
  • pip 21.2.4
  • Ubuntu 20.04 from NVidia NVCR docker image nvcr.io/nvidia/pytorch:22.03-py3
  • On Kubernetes cluster

pkubik avatar Jun 11 '22 15:06 pkubik

@mahnerak have you seen such error previously? as I recall you had a setup for NFS with Aim?

alberttorosyan avatar Jun 11 '22 15:06 alberttorosyan

After I restarted the jobs they seem to continue logging correctly without errors, but there is an expected gaps within the metrics (from when they were not logged due to errors).

pkubik avatar Jun 11 '22 15:06 pkubik

I’ve seen stale handle errors a lot in my NFS setup (in cases not involving aim at all). I guess it’s mainly an NFS issue, but I’ll look into it.

Please let me know if such NFS-specific issues happen in future.

mahnerak avatar Jun 11 '22 15:06 mahnerak

Is it in general safer to run a single aim server on the cluster and attach my jobs by TCP?

pkubik avatar Jun 11 '22 15:06 pkubik

I wouldn't necessarily focus on the stale handle errors that much. The most important part seems to be that a process is trying to append to a log file that no longer exist. Each job is likely only accessing their own files, and UI is most likely not changing any files.

I guess it could occur due to multi-processing that occur in Pytorch. There might be also some inconsistency in NFS as it's quite old and crazy. Could it be that the error occurs due to some caching mechanism?

Is this file structure I pasted at the top of the issue specific to aim or rocksdb?

pkubik avatar Jun 13 '22 00:06 pkubik

@pkubik wrt this:

Is it in general safer to run a single aim server on the cluster and attach my jobs by TCP?

aim server is used for remote tracking by several users. It works reasonably well, but it is still considered as an experimental feature. We are planning to fix the known issues and make a stable release in the next milestone. Some things to be aware if you're trying to use it now:

  • With many parallel trainings remote tracking will slow down if the write frequency is too high (i.e. writing many metrics every second)

As for your second question, the .log and .sst files are internal for rocksdb. .log is a write ahead log which gets flushed based on the options used. In general, the situation with missing .log file should not occur since it is managed by a single process, but this might be a bug on rocksdb side. We have to dig further into this issue to know for sure.

alberttorosyan avatar Jun 13 '22 07:06 alberttorosyan

I've been experiencing this issue a lot myself.

Similar-ish scenario, maybe a bit more demanding: I have a single Aim repo in a NFS-mounted folder with many Lightning DDP SLURM jobs logging to it. My rate of logging is quite low at once per 50 training iterations (which I think is ok since I'm training a slow model that progresses at around 2 train iters/sec). Since I'm using DDP, I have anywhere between 16–64 Lightning Trainers per SLURM job (at 1 Trainer per process / GPU), with about 8 jobs running. So, I have between 8 * (16–64) = 128–512 processes logging to that single Aim repo across 8 Aim runs (at one Aim run per job), every 50 training iterations for each of the runs. So every ~ 2 mins or so, I'm logging from 128–512 processes to a single Aim repo.

It doesn't seem to cause any problems on the metrics side of things, but it does take a LONG time to load more than a couple of those runs in the Metrics Explorer. I'm talking several minutes or sometimes never (when it bugs out), requiring multiple restarts.

Also, every 1000 train iters I log ~2000 images at 64x64 resolution, per job / run. In summary, 8 jobs * 2000 images = 16,000 images every ~34 minutes. Now, loading even just two of these jobs in the Images Explorer is a total crap shoot. Depending on how many iterations-worth of images have been logged so far and how many runs I'm viewing concurrently in the Images Explorer, I'm either waiting 5-10 minutes or it bugs out and crashes. I can definitely reduce the number of images I'm logging, but still, I'm surprised that each time I want to make a change to my query in the explorer, I have to wait for it to check all runs, waiting for tens of minutes. I would expect it wouldn't try to sift through every possible log and just give me the most recent K iterations of logged images.

Anyways, regarding this high intensity usage of Aim, will 4.0 bring improvements here? I would love to just log however much I want and view it freely without waiting, without experiencing aimrocks errors, and so forth, only being limited by my CPU bandwidth.

tesfaldet avatar Aug 28 '23 15:08 tesfaldet