aim How to scale aim repo to support simultaneous writes from multiple training runs

❓Question

in order to compare multiple training runs side by side, I tried make them write to the same aim repo located on a shared EFS. However when I do this, I see a huge number of error logs like below printed in some of those runs:

Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'

My questions are:

Does aim repo current support simultaneous writes from multiple training runs?
If yes, how scalable is that? In my case I encountered these errors when just running 3 simultaneous jobs
If not, is there a plan to support that?
Could this be related to that fact that I'm using EFS to store the repo? If so, are there better alternatives?

Aug 26 '22 08:08 jiyuanq

Hello @jiyuanq! Thanks for the question. Aim is designed to support multiple parallel trainings. In fact, some users have reported to have >100 parallel trainings without any issues. The error you have shared might be specific to your setup. May I ask you to provide more details about the environment used?

Aug 26 '22 08:08 alberttorosyan

@alberttorosyan sure, I was running single GPU training with pytorch 1.10.0, pytorch lightning 1.5.10, aim 3.13.0, on python 3.8.10, and I was using the provided AimLogger. The training job was running normally for like 900 mini-batches and then suddenly there were tons of error logs like what I shared, to the extent that I couldn't locate exactly where this started happening because earlier logs were lost. Another thing is that this didn't actually fail my training run, it just got stuck with these logs filling up and never made any more progress

Aug 26 '22 09:08 jiyuanq

Also I'm wondering if using remote tracking server will make it easier to scale writes?

Aug 26 '22 09:08 jiyuanq

@alberttorosyan unfortunately this seems to happen to me pretty often. I had two more runs hitting the same error after around 40k mini batches of training. I checked the logs and there's nothing more informative than what I already shared. Should I repost this as a bug? I think it's really making it challenging for me to use aim on critical training tasks, even though I really like the UI

Aug 29 '22 02:08 jiyuanq

To share more information:

The two jobs started encountering the issue at around the same time
While training logs stopped updating at minibatch 24500-ish, it looks like aim metrics were updated to 40k minibatches, with a weird pattern (see below), so i guess training actually continued for a while it's just no more logs where printed?
The EFS I'm using for aim is also used by other use cases, and I found that it's network IO usage can sometimes reach almost 100%, but i'm not sure how that may impact aim writes

I think what I can try on my end is to see if switching to another EFS with more network bandwidth would help. At the same time I also hope aim can provide a robust way of writing experiment logs

Aug 29 '22 03:08 jiyuanq

@jiyuanq thanks for the additional info. The screenshot above looks strange; seems to be a result of data corruption. Can you also share how many track method calls per second you have on average? Will try to reproduce the issue, since at the moment I have no clue why this happens. Meanwhile you can give a try to remote tracking server.

Aug 29 '22 05:08 alberttorosyan

@jiyuanq thanks for the additional info. The screenshot above looks strange; seems to be a result of data corruption. Can you also share how many track method calls per second you have on average? Will try to reproduce the issue, since at the moment I have no clue why this happens. Meanwhile you can give a try to remote tracking server.

How can I find the exact number of track method calls per second? If I were to estimate, I'm using pytorch lightning with 3 metrics and the default logging interval of 50 steps. It takes about 20s to run 50 steps, so track calls per second should be about 0.15 per training run

Aug 29 '22 08:08 jiyuanq

that's exactly what I was looking for 🙌

Aug 29 '22 08:08 alberttorosyan

Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue.. I'll try remote tracker as well

Aug 30 '22 09:08 jiyuanq

Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue.. I'll try remote tracker as well

@jiyuanq, are those two runs started from scratch or they are the same which were failing before?

Aug 30 '22 09:08 alberttorosyan

Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue.. I'll try remote tracker as well

@jiyuanq, are those two runs started from scratch or they are the same which were failing before?

I'm running experiments for hyperparameter tuning, so all the 7 runs are similar to the failed ones except some hyperparameter values. (eg. activation function)

Aug 30 '22 09:08 jiyuanq

aim aim copied to clipboard

How to scale aim repo to support simultaneous writes from multiple training runs

❓Question

aim
aim copied to clipboard