WizardLM
WizardLM copied to clipboard
How can I use multiple GPUs for inference.
here is my GPUs info:
GPU info: H800 * 8
CUDA: 11.8
nvidia-smi
Mon Sep 4 10:39:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Graphics... Off | 00000000:0F:00.0 Off | 0 |
| N/A 30C P2 65W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Graphics... Off | 00000000:34:00.0 Off | 0 |
| N/A 29C P2 67W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA Graphics... Off | 00000000:48:00.0 Off | 0 |
| N/A 30C P2 67W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
here is my inference code:
from peft import PeftModel
from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer
import torch
# create tokenizer
base_model = "/home/WizardCoder-15B-V1.0/"
tokenizer = AutoTokenizer.from_pretrained(base_model)
# base model
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto",
)
# LORA PEFT adapters
adapter_model = "/home/adapter_model"
model = PeftModel.from_pretrained(
model,
adapter_model,
#torch_dtype=torch.float16,
)
model.eval()
# prompt
prompt = "请写一个sql, 使用dual表查看当前时间"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs["input_ids"].to('cuda')
# Generate
generate_ids = model.generate(input_ids=inputs, max_new_tokens=30)
print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
here is my runing method, and runing result
(base) [root@localhost WizardCoder]# CUDA_VISIBLE_DEVICES=6,7 /home/wxp/python/pythonwizard/bin/python3 DEMCoder_test_v2.py
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
请写一个sql, 使用dual表查看当前时间戚Fartherthanrrayrraypadder殊 Provide
why ??? How can I use multiple GPUs for inference. help me, Thanks!!
When I used the conference code you provided, there was still a problem, and the problem is below:
CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \
--base_model "WizardCoder-15B-V1.0" \
--input_data_path "data.jsonl" \
--output_data_path "result.jsonl"
input_data_path as show:
{"idx": 11, "Instruction": "Write a Python code to count 1 to 10."}
I get a error results:
{"id": 0, "instruction": "Write a Python code to count 1 to 10.", "wizardcoder": "```pythonrrays =rrayrrayss = ArraysWithsWithoutrraysss = = ityityrrayrrayss = ArraysWithsWithsWithout including including includingCOMMCOMMCOCOapodsapodsuppeuppeanoanoanoanoanoanoanoanoanorrayrrayrrayrrayrrays =ViewDatailsabout howabout how howaboutaboutrrayrrayrrayrrayss = = =cutcutcutrrayCOUNTCOUNTCOUNTetcetcetc"}
why???
CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \
--base_model "WizardCoder-15B-V1.0" \
--input_data_path "data.jsonl" \
--output_data_path "result.jsonl"
This works fine on our machine. Which version of transformers do you use?
@ChiYeungLaw
CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \ --base_model "WizardCoder-15B-V1.0" \ --input_data_path "data.jsonl" \ --output_data_path "result.jsonl"
This works fine on our machine. Which version of transformers do you use?
there is my pkgs info:
Package Version
------------------------ ----------
accelerate 0.20.3
aiofiles 23.2.1
aiohttp 3.8.5
aiosignal 1.3.1
annotated-types 0.5.0
async-timeout 4.0.3
attrs 23.1.0
black 23.3.0
certifi 2023.5.7
charset-normalizer 3.1.0
cmake 3.26.4
dataclasses-json 0.5.14
filelock 3.12.2
fire 0.5.0
flake8 6.0.0
frozenlist 1.4.0
fsspec 2023.6.0
greenlet 2.0.2
h11 0.9.0
html5tagger 1.3.0
httpcore 0.11.1
httptools 0.6.0
httpx 0.15.4
huggingface-hub 0.15.1
idna 3.4
Jinja2 3.1.2
langchain 0.0.271
langsmith 0.0.26
lit 16.0.6
MarkupSafe 2.1.3
marshmallow 3.20.1
mccabe 0.7.0
mpmath 1.3.0
multidict 5.2.0
mypy-extensions 1.0.0
networkx 3.1
numexpr 2.8.5
numpy 1.25.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
packaging 23.1
pathspec 0.11.1
pip 23.1.2
psutil 5.9.5
pydantic 2.2.1
pydantic_core 2.6.1
pyflakes 3.0.1
PyYAML 6.0
regex 2023.6.3
requests 2.31.0
rfc3986 1.5.0
safetensors 0.3.1
sanic 20.12.6
sanic-routing 23.6.0
setuptools 58.1.0
six 1.16.0
sniffio 1.3.0
SQLAlchemy 2.0.20
sympy 1.12
tenacity 8.2.3
termcolor 2.3.0
tokenizers 0.13.3
torch 2.0.1
torch-tb-profiler 0.4.1
tqdm 4.65.0
tracerite 1.1.0
transformers 4.29.0
triton 2.0.0
typing_extensions 4.6.3
typing-inspect 0.9.0
ujson 5.8.0
urllib3 1.26.7
utils 1.0.1
uvloop 0.17.0
websockets 9.1
wheel 0.40.0
yarl 1.9.2
CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \ --base_model "WizardCoder-15B-V1.0" \ --input_data_path "data.jsonl" \ --output_data_path "result.jsonl"
This works fine on our machine. Which version of transformers do you use? @ChiYeungLaw Can you tell me the information about the machine you are testing. For example, GPU, Cuda, Python? thks!
torch==2.0.1
transformers==4.29.2
2xV100 32GiB
python==3.10
cuda==11.4
@ChiYeungLaw 个人猜测问题可能是 cuda11.8 H800 GPU之间的兼容关系导致的。由于租的服务器已经到期,transformers这个问题暂时也没办法验证了。 另外的一个问题是,我们都知道WizardCoder-15B-V1.0模型大概有32G,那么我们通过单机多卡(2*V100)的模式加载进去之后,单个卡上的显存消耗是否在32G/ 2 = 16G左右呢? 您可以帮忙验证一下吗? 然后将结果单卡运行的显存占用以及2卡运行的显存占用结果贴出来吗? 谢谢了!
I guess is that the issue may be caused by the compatibility relationship between cuda11.8 H800 GPUs. Due to the expiration of the cloud server, it is currently impossible to verify this issue。 Another question is that we all know that the WizardCoder-15B-V1.0 model has approximately 32GB. So, after we load it in a single machine multi GPU (2 * V100) mode, will the graphics memory consumption on a single card be around 32GB/2=16GB? Can you show the guess ? Then paste the results of the video memory usage for single GPU operation and the video memory usage results for 2-GPU operation? Thank you!