OpenChatKit python inference/bot.py Killed

git clone https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B run python inference/bot.py --model GPT-NeoXT-Chat-Base-20B Loading GPT-NeoXT-Chat-Base-20B to cuda:0... Killed

run python inference/bot.py OSError: Can't load the configuration of '/root/test/OpenChatKit-main/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/root/test/OpenChatKit-main/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B' is the correct path to a directory containing a config.json file

so cp -r GPT-NeoXT-Chat-Base-20B huggingface_models/

root@msi:~/test/OpenChatKit-main# python inference/bot.py Loading /root/test/OpenChatKit-main/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:0... Killed

I am confused, it is running in docker, is the gpu not enough video memory?

Mar 13 '23 13:03 sherlockzym

Same thing here but I think it errors out before that. What GPU memory is required to run this model?

Mar 16 '23 05:03 nickvazz

I don't know, my gpu has 11g of video memory, and I ran nvidia-smi -l 1 and didn't see a change

Mar 16 '23 06:03 sherlockzym

So I also think something might have gone wrong before the gpu was called

Mar 16 '23 06:03 sherlockzym

I am experiencing the same thing when i try to load the model from huggingface via model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-NeoXT-Chat-Base-20B") i am trying to run the model on two a100 40g

Mar 16 '23 11:03 HaupChris

You may want to check the dmesg output. I met this issue when I requested the wrong memory from Slurm. dmesg gave the following error:

Memory cgroup out of memory: Killed process 3574895 (python) total-vm:16039436kB......

So I set a larger memory size and it works. I am running the togethercomputer/GPT-NeoXT-Chat-Base-20B model, the default bot.py, and the default max-tokens (128). It takes up 86.4 GB of system memory and 40.7 GB of GPU memory to load the model. The GPU memory usage jumps to 66.84 GB after playing for a while. It is an 80GB A100.

Mar 17 '23 14:03 lokikl

You'll need more than 40GB of VRAM to run the model. An 80GB A100 is definitely enough. A 48GB A40 might work, but that might be cutting it a little close.

Some folks on Discord have had success running on 2x A100 40GB cards using Huggingface's accelerate. I'm hoping to add some sample code to the repo to run inference on multiple cards.

Mar 18 '23 06:03 csris

So I also think something might have gone wrong before the gpu was called

@nickvazz @sherlockzym You're right. The model never got to your GPU, as @lokikl points out, you most likely ran out of RAM. This has been fixed and now takes up significantly less memory to load the model.

I don't know, my gpu has 11g of video memory, and I ran nvidia-smi -l 1 and didn't see a change

I haven't tested this on 11 GB VRAM yet but I don't see why it shouldn't work. Could you add -g 0:8 -r MAXRAM where MAXRAM is the maximum amount of CPU RAM you'd like to allocate to the model. Note, if VRAM+MAXRAM < the size of the model (40 GB for GPT-NeoXT-Chat-Base-20B), the rest of the model will be offloaded to a folder offload on your disk. I've set the max to 8GB of VRAM on CUDA device 0 in this example because each new prompt adds to the amount of VRAM occupied. You can increase this value (8) if you don't run into OOM issues.

Note, this can be quite slow.

Could you test this out and report back with your results?

Mar 29 '23 14:03 orangetin