glara76 comments

Results 4 comments of


                                            glara76

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm

Thank you for your reply. Since there were many works based on 1 node last year (GPT-3 on tensorrt-llm platform), we are now aiming to extend those implementations to multi-node...

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm

The multinode run flow is shown below, let me know if you need anything else. >> **sbatch sbatch_multi_run_gpt.sh** ****sbatch_multi_run_gpt.sh**** #!/bin/bash #SBATCH --account=jychoi #SBATCH --gres=gpu:4 #SBATCH --ntasks-per-node=4 #SBATCH --partition=debug #SBATCH --job-name=multinode_GPT3...

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm

Hi, @jinyangyuan-nvidia, thank you for your reply. I fixed the part you mentioned (build_config.auto_parallel_config.gpus_per_node=8-->4) and that error message went away for now. I'm not sure if it's gone completely though,...

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm

After resolving a separate issue, I've encountered the same "invalid device ordinal" error again while running TensorRT-LLM in a SLURM multinode environment. If code modification inside TensorRT-LLM is required, I...