Bus error finetuning whisper model in multi GPU instances

Open hitesh-ag1 opened this issue 2 years ago • 3 comments

Hi, I am trying to finetune Whisper according to the blog post here. The finetuning works great in a single GPU scenario, however, fails with multi GPU instances. While executing trainer.train(), multi GPU instances return Bus error (core dumped).

I am working on g5.12xlarge instance for multi GPU on AWS with AMI ID: ami-071323fe2bf59945b on Ubuntu. I would appreciate any guidance or suggestions to resolve this issue.

Dec 06 '23 13:12 hitesh-ag1

cc @sanchit-gandhi Any help would be greatly appreciated!

Dec 06 '23 13:12 hitesh-ag1

Happening to me as well.

Dec 11 '23 15:12 dkrystki

Hey @hitesh-ag1, sorry for the late reply here. Could you confirm that you're using one-to-one the same code as with single-GPU fine-tuning? Could you also provide the full stack trace for the error that you're getting? For interest, there's a multi-gpu example for Whisper fine-tuning that you can check out in the Transformers library.

Apr 02 '24 14:04 sanchit-gandhi