ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Bug: AI-Toolkit gets confused with more than 2 x GPU's? e.g. "GPU 0" is 1, "GPU 1" is 2, "GPU 2" shifts back to 0

Open ddjm1973 opened this issue 5 months ago • 6 comments

Did you already ask in the discord?

Yes. But there is no "bug reports" section?? And too many things flashing and blinking, to this boomer it is not obvious where I am supposed to go or post my question.

You verified that this is a bug?

Yes. Should very obviously be a bug. Program is using GPU 0 when told to use "GPU 2", GPU 2 even though it is told to use "GPU 1", GPU 1 when it is told to use "GPU 0".

Describe the bug

As can be seen in this image the system in question has 3 x GPU's:

Image

  • GPU 0: RTX 2000E Ada Generation, 16 GB RAM
  • GPU 1: RTX 3090, 24 GB RAM
  • GPU 2: RTX 3090, 24 GB RAM

After further testing, it seems that with more than 2 x GPU's present in the system, AI-Toolkit gets confused and all GPU's get shifted by 1: "GPU 0" is actually GPU 1. "GPU 1" is actually GPU 2, "GPU 2" is actually GPU 0 ... and trying to use "GPU 2" which then in fact uses GPU 0 which is a RTX 2000E with only 16 GB VRAM, so I get OOM errors because it is trying to run on the wrong card.

As can be seen in this screenshot, I have a "Training Job" and it is told to use "GPU 1", but actually it is using GPU 2:

Image

  • GPU 1 that was supposed to be used is sitting idle ...
  • the training job is de facto running on GPU 2
  • AI-Toolkit insists it is using "GPU 1" as per Dashboard

Dashboard shows the job is running on GPU 2 and not on "GPU 1" like the running Training Job claims:

Image

IMPACT:

  • Minor inconvenience, not a high priority. If you know that "GPU 0" is GPU 1, "GPU 1" is in fact GPU 2 and "GPU 2" is in fact GPU 0 ... sure, one can live with that.

But I am sure this qualifies as bug and should be corrected some time, if possible.

ddjm1973 avatar Jul 22 '25 05:07 ddjm1973

AI-Toolkit is aligning GPU indexing with the outputs provided by the console tool nvidia-smi, as this represents the established standard in computer science for hardware enumeration, ensuring consistency and reliability across systems.

tarnvaal avatar Jul 26 '25 00:07 tarnvaal

Output from nvidia-smi:

Image

=> AI-Toolkit is still wrong

Just look at this screenshot and the red circle:

Image

  • Training job claims it is running on "1" ... (red circle)
  • but the Dashboard clearly shows that "1" is idle, the job is in fact running on "2"

GPU indexing is not the issue!! We can count "1, 2 ,3" ... or "0, 1, 2" ... or binary "00, 01, 10" or "Alpha, Beta, Gamma..." for all I care. AI-Toolkit would still be watching the wrong card.

=> if AI-Toolkit claims it is running a training job on "1" then it really should be running that training job on "1" and not on "2" like the Dashboard clearly shows.

ddjm1973 avatar Jul 26 '25 10:07 ddjm1973

More screenshots added as more clarification seems necessary?

Image
  • the Training job is supposed to run on "GPU 0: RTX 2000E Ada Generation" (very masochistic, I know ...) but instead is running on "GPU 1: RTX 3090" ... in this particular case that's a good thing, as the RTX 2000E would never be able to finish a training ...

Image

  • the Training job is supposed to run on "GPU 1: RTX 3090" but instead is running on "GPU 2: RTX 3090". In this particular case it does not matter too much, both cards are RTX 3090. But AI-Toolkit is monitoring the wrong card now ...
Image Image Image

As can be seen here: The training job is watching "GPU 1" ... but the job is actually running on "GPU 2" as can be seen on the Dashboard.

=> not only is AI-Toolkit using the wrong card (e.g. "2" instead of "1" like it was told), it is now also watching the wrong card.

But here is where it gets annoying:

Image
  • the Training job is supposed to be running on "GPU 2: RTX 3090" but it is actually trying to use "GPU 0: RTX 2000E Ada Generation"

... and this will of course fail, the job gets aborted with OOM errors.

Hence my conclusion that this must be a bug.

ddjm1973 avatar Jul 26 '25 22:07 ddjm1973

I would like to ask if it supports multi-GPU training?

yatoubusha avatar Aug 01 '25 06:08 yatoubusha

I would like to ask if it supports multi-GPU training?

No, consider that using two gpu's will halt the calculations that are happening at the tera-flop, peta-flop speeds and route them through a memory controller/ bus in the ghz range.

jargoman avatar Oct 17 '25 16:10 jargoman

No, consider that using two gpu's will halt the calculations that are happening at the tera-flop, peta-flop speeds and route them through a memory controller/ bus in the ghz range.

What on earth are you babbling about? Nvidia cards communicate P2P via PCIe.

And this happens at the end of a step for a full batch, or even multiple steps with Accelerate SGD.

I've never seen someone so confidently wrong.

orcinus avatar Oct 31 '25 11:10 orcinus