aim
aim copied to clipboard
How to scale aim repo to support simultaneous writes from multiple training runs
❓Question
in order to compare multiple training runs side by side, I tried make them write to the same aim repo located on a shared EFS. However when I do this, I see a huge number of error logs like below printed in some of those runs:
Traceback (most recent call last):
File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'
My questions are:
- Does aim repo current support simultaneous writes from multiple training runs?
- If yes, how scalable is that? In my case I encountered these errors when just running 3 simultaneous jobs
- If not, is there a plan to support that?
- Could this be related to that fact that I'm using EFS to store the repo? If so, are there better alternatives?
Hello @jiyuanq! Thanks for the question. Aim is designed to support multiple parallel trainings. In fact, some users have reported to have >100 parallel trainings without any issues. The error you have shared might be specific to your setup. May I ask you to provide more details about the environment used?
@alberttorosyan sure, I was running single GPU training with pytorch 1.10.0, pytorch lightning 1.5.10, aim 3.13.0, on python 3.8.10, and I was using the provided AimLogger. The training job was running normally for like 900 mini-batches and then suddenly there were tons of error logs like what I shared, to the extent that I couldn't locate exactly where this started happening because earlier logs were lost. Another thing is that this didn't actually fail my training run, it just got stuck with these logs filling up and never made any more progress
Also I'm wondering if using remote tracking server will make it easier to scale writes?
@alberttorosyan unfortunately this seems to happen to me pretty often. I had two more runs hitting the same error after around 40k mini batches of training. I checked the logs and there's nothing more informative than what I already shared. Should I repost this as a bug? I think it's really making it challenging for me to use aim on critical training tasks, even though I really like the UI
To share more information:
- The two jobs started encountering the issue at around the same time
- While training logs stopped updating at minibatch 24500-ish, it looks like aim metrics were updated to 40k minibatches, with a weird pattern (see below), so i guess training actually continued for a while it's just no more logs where printed?
- The EFS I'm using for aim is also used by other use cases, and I found that it's network IO usage can sometimes reach almost 100%, but i'm not sure how that may impact aim writes
I think what I can try on my end is to see if switching to another EFS with more network bandwidth would help. At the same time I also hope aim can provide a robust way of writing experiment logs
@jiyuanq thanks for the additional info. The screenshot above looks strange; seems to be a result of data corruption. Can you also share how many track
method calls per second you have on average?
Will try to reproduce the issue, since at the moment I have no clue why this happens.
Meanwhile you can give a try to remote tracking server.
@jiyuanq thanks for the additional info. The screenshot above looks strange; seems to be a result of data corruption. Can you also share how many
track
method calls per second you have on average? Will try to reproduce the issue, since at the moment I have no clue why this happens. Meanwhile you can give a try to remote tracking server.
How can I find the exact number of track method calls per second?
If I were to estimate, I'm using pytorch lightning with 3 metrics and the default logging interval of 50 steps. It takes about 20s to run 50 steps, so track
calls per second should be about 0.15 per training run
that's exactly what I was looking for 🙌
Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue.. I'll try remote tracker as well
Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue.. I'll try remote tracker as well
@jiyuanq, are those two runs started from scratch or they are the same which were failing before?
Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue.. I'll try remote tracker as well
@jiyuanq, are those two runs started from scratch or they are the same which were failing before?
I'm running experiments for hyperparameter tuning, so all the 7 runs are similar to the failed ones except some hyperparameter values. (eg. activation function)