James McCall
James McCall
I am using 8x A10 GPUs (g5.48xl), and have replicated the errors in defragment using the customization specified here ([Training on Other Instances](https://github.com/databrickslabs/dolly#training-on-other-instances)) as well as downgrading deepspeed to deepspeed==0.8.3....
Apologies, I ran the training with 0.8.3 overnight and it was successful. When changing deepspeed==0.8.3 in my requirements.txt, though, it still kept installing 0.9.1, perhaps due to the wheel caching;...
@ahakanbaba I did try that and it still did not work (using A10 GPUs).