Anthony de Guia

moe.ph Philippines From Philippines, with love.

Results 24 comments of


                                            Anthony de Guia

ncclCommInitRank failed: unhandled cuda error

Hello, just adding this information here. https://github.com/horovod/horovod/issues/3752#issuecomment-1285423023 If you are using CUDA11.6, it might be better to downgrade the NCCL. I was able to use horovod-v0.24.3 and horovod-v0.26.1 with Pytorch-v1.13.0...

[WSL2 Ubuntu18.04 Cuda11.1] unable to fine EGL device for CUDA device 0

@vincent341 you might need to align your NVIDIA driver version with your nvidia-gl library. ``` $ nvidia-smi Wed Mar 10 10:17:14 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA...

[WSL2 Ubuntu18.04 Cuda11.1] unable to fine EGL device for CUDA device 0

> Hi, I meet the same question as you and really want to try your solutions. But I really get some problems. I know the time has passed for a...

Add db:create and db:drop support for MariaDB

Hi, just want to check if there will be an update on reviewing this `mariadb` support for create/drop action. Thank you.

[Enhancement Request] Enhance capacity scheduling

Keep tracking of this enhancement request. I'm also assessing multiple batch schedulers for Kubernetes for running ML workloads, and this one interesting feature that's missing in Volcano compared to another...

CRIU Checkpoint and Restore with CRI CDI and GPU Devices

Thank you very much for the very valuable feedback! > This basically copies the devices from the checkpointed container to the restored container. Yes, I guess we can work-around with...

CRIU Checkpoint and Restore with CRI CDI and GPU Devices

Ohh thank you Radostin. I was considering the external mapping config, but it does not look portable for larger scale in my initial point of view. I guess I should...

CRIU Checkpoint and Restore with CRI CDI and GPU Devices

Thank you for this information! I was able to create a single node checkpoint and restore by specifying all possible NVIDIA external mounts on the config and let the auto-detection...

CRIU Checkpoint and Restore with CRI CDI and GPU Devices

I will close this issue now since I don't have follow-up concern with this. Thank you for the guidance!

Error (criu/proc_parse.c:479): anon_inode:[io_uring]

Looks like this is related to #2131. I am also interested in checking this concern, since I also experienced the same error with vLLM checkpoint. ``` (00.114439) vma 7535130cc000 borrows...

1
2
3
›