Anthony de Guia

Results 24 comments of Anthony de Guia

Hello, just adding this information here. https://github.com/horovod/horovod/issues/3752#issuecomment-1285423023 If you are using CUDA11.6, it might be better to downgrade the NCCL. I was able to use horovod-v0.24.3 and horovod-v0.26.1 with Pytorch-v1.13.0...

@vincent341 you might need to align your NVIDIA driver version with your nvidia-gl library. ``` $ nvidia-smi Wed Mar 10 10:17:14 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA...

> Hi, I meet the same question as you and really want to try your solutions. But I really get some problems. I know the time has passed for a...

Hi, just want to check if there will be an update on reviewing this `mariadb` support for create/drop action. Thank you.

Keep tracking of this enhancement request. I'm also assessing multiple batch schedulers for Kubernetes for running ML workloads, and this one interesting feature that's missing in Volcano compared to another...

Thank you very much for the very valuable feedback! > This basically copies the devices from the checkpointed container to the restored container. Yes, I guess we can work-around with...

Ohh thank you Radostin. I was considering the external mapping config, but it does not look portable for larger scale in my initial point of view. I guess I should...

Thank you for this information! I was able to create a single node checkpoint and restore by specifying all possible NVIDIA external mounts on the config and let the auto-detection...

I will close this issue now since I don't have follow-up concern with this. Thank you for the guidance!

Looks like this is related to #2131. I am also interested in checking this concern, since I also experienced the same error with vLLM checkpoint. ``` (00.114439) vma 7535130cc000 borrows...