OpenChatKit
OpenChatKit copied to clipboard
Cupy error while training (`CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal`)
Describe the bug The bash script to train the model does not work because of a Cupy error:
(OpenChatKit-Test) user@pc:~/OpenChatKit$ bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
Traceback (most recent call last):
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
Traceback (most recent call last):
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
To Reproduce Steps to reproduce the behavior:
- Run code on WSL-Ubuntu in a Conda Env
- Run the bash script
bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
- The error above is produced
Expected behavior The code is supposed to execute.
Screenshots NA
Desktop (please complete the following information):
- OS: Windows 11
- Ubuntu-WSL
- Miniconda
- Nvidia GeForce 3060 (Could this be the issue?)
Additional context Also, the previous steps to download the data and weights also gave me errors. These steps:
python data/OIG/prepare.py
python pretrained/GPT-NeoX-20B/prepare.py
Ended after a couple minutes/hours with the error message "Killed". I was able to acquire the data sets with a simple wget command but I thought that was weird too.
Update: I was able to fix this particular error by limiting the script to just one pipeline. And I was also able to run ~python data/OIG/prepare.py
~ pretrained/GPT-NeoX-20B/prepare.py
by forcing it to use hard-disk instead of GPU or CPU if memory is limited. I will share this fix a little later when I figure out how to run the rest of the scripts as it may help others run this on lower-end hardware.
I believe this script can be tweaked to run on computers with lower minimum requirements than the script currently requires but this will need further investigation. I will be looking into this and will post an update soon.
But FOR NOW, the script crashes with just the message "Killed" and the line number in the bash script.
I was able to trace the error back to somewhere b/w lines 163-223 in training/pipeline_parallel/dist_gpipe_pipeline_async.py
I will investigate this further and report back. In the mean time, if anybody knows what's going on with this, I'd appreciate the help.
Looks like the Nvidia GeForce 3060 has either 12GB or 8GB of VRAM. Unfortunately, I don't think you'll be able to train on this card. I think we normally require 8x A100 80GB GPUs to do a full training. @LorrinWWW, do you have any advice for training on lower-end hardware?
Also, @orangetin, can you tell me more about your fix for data/OIG/prepare.py
? All it does is download data using git lfs
and unzip files using the standard library's gzip
. It shouldn't be touching the GPU.
Thank you for the detailed bug report! The details were very helpful.
@csris correction: I fixed pretrained/GPT-NeoX-20B/prepare.py
not data/OIG/prepare.py
. data/OIG/prepare.py
ran just fine for me.
Every time I ran pretrained/GPT-NeoX-20B/prepare.py
, it ran for a couple minutes and just gave the output "Killed". I figured out the issue, my computer was out of memory. I was able to trace the error back to line 27: model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16)
AutoModelForCausalLM.from_pretrained
can accept two more arguments: device_map="auto", offload_folder="SOME_FOLDER"
. This forces transformers to use the hard disk as cache storage when the RAM isn't enough.
So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of pretrained/GPT-NeoX-20B/prepare.py
, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER")
and replace SOME_FOLDER with an existing but empty directory.
I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.
Unfortunately, I'm a college student and, as of right now, can't afford 8x A100 80GB GPUs, but I'm determined to make this work XD. I was able to run pretty large models on just a CPU so I think this should be possible with the GeForce.
@orangetin, are you on our Discord server (https://discord.gg/7fDdZNwA)? I'd like to chat with you about this effort.
So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of pretrained/GPT-NeoX-20B/prepare.py, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER") and replace SOME_FOLDER with an existing but empty directory.
Makes sense and might be a good change even on system with a lot of RAM. Mind submitting a PR? I'll try it out on one of our machines.
I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.
That's really impressive! If you get this working, definitely mention this in the #openchatkit channel on the Discord server. There have been lots of people trying to make this work on lower-end hardware.
Makes sense and might be a good change even on system with a lot of RAM. Mind submitting a PR? I'll try it out on one of our machines.
Yup, I'll submit the PR soon.
are you on our Discord server (https://discord.gg/7fDdZNwA)? I'd like to chat with you about this effort.
I'm on the server, I can send you a message.
@csris correction: I fixed
pretrained/GPT-NeoX-20B/prepare.py
notdata/OIG/prepare.py
.data/OIG/prepare.py
ran just fine for me.Every time I ran
pretrained/GPT-NeoX-20B/prepare.py
, it ran for a couple minutes and just gave the output "Killed". I figured out the issue, my computer was out of memory. I was able to trace the error back to line 27:model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16)
AutoModelForCausalLM.from_pretrained
can accept two more arguments:device_map="auto", offload_folder="SOME_FOLDER"
. This forces transformers to use the hard disk as cache storage when the RAM isn't enough.So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of
pretrained/GPT-NeoX-20B/prepare.py
, tomodel = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER")
and replace SOME_FOLDER with an existing but empty directory.I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.
Unfortunately, I'm a college student and, as of right now, can't afford 8x A100 80GB GPUs, but I'm determined to make this work XD. I was able to run pretty large models on just a CPU so I think this should be possible with the GeForce.
I met this error and I change line 27 of pretrained/GPT-NeoX-20B/prepare.py, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER"), also running on WSL. The result for me is that some of the pt files are broken (i.e: pytorch_lm_head.pt) using this way so I also fail while run training. A work around solution for me is to run this prepare.py in Windows to download the pt files, then move them to my WSL. With 32GB RAM I am able to run the script.
# vi training/finetune_GPT-NeoXT-Chat-Base-20B.sh
(trap 'kill 0' SIGINT; \
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
& \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
# & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
# & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
# & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
# & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
# & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
# & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
# & \
wait)
Fixed.
For training: Invalid CUDA ID followed by OOM error. Solved by fixing CUDA IDs and using a GPU with the required amount of VRAM for training.
For downloading model: Solved in #63 by offloading parts of the model to disk.
@csris correction: I fixed
pretrained/GPT-NeoX-20B/prepare.py
notdata/OIG/prepare.py
.data/OIG/prepare.py
ran just fine for me.Every time I ran
pretrained/GPT-NeoX-20B/prepare.py
, it ran for a couple minutes and just gave the output "Killed". I figured out the issue, my computer was out of memory. I was able to trace the error back to line 27:model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16)
AutoModelForCausalLM.from_pretrained
can accept two more arguments:device_map="auto", offload_folder="SOME_FOLDER"
. This forces transformers to use the hard disk as cache storage when the RAM isn't enough.So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of
pretrained/GPT-NeoX-20B/prepare.py
, tomodel = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER")
and replace SOME_FOLDER with an existing but empty directory.I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.
Unfortunately, I'm a college student and, as of right now, can't afford 8x A100 80GB GPUs, but I'm determined to make this work XD. I was able to run pretty large models on just a CPU so I think this should be possible with the GeForce.
I see too that there is a argument for --offload-dir
I see too that there is a argument for
--offload-dir
@joecodecreations Yes, that argument was added in the PR mentioned above.
Fixed.
For training: Invalid CUDA ID followed by OOM error. Solved by fixing CUDA IDs and using a GPU with the required amount of VRAM for training.
For downloading model: Solved in #63 by offloading parts of the model to disk.
How much vram did you end having to use for training?