how to change Bfloat16 to float16 (I can only use a GPU with compute capacity >= 8.0 to run this repo?)
when I run the pipeline
python run_exp.py --method_name 'naive' \
--split 'test' \
--dataset_name 'nq' \
--gpu_id '0,1,2,3'
I get this error: in the last line, I need to append --dtype=half in CLI. I think somewhere the code is using Bfloat 16, not the float16 type.
INFO 07-28 00:30:38 config.py:715] Defaulting to use mp for distributed inference
INFO 07-28 00:30:38 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/smalldog/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/home/smalldog/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/smalldog/Meta-Llama-3-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-28 00:30:38 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242034) INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=242034) INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242036) INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=242036) INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242035) INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=242035) INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242035) Process VllmWorkerProcess:
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
I am using 4 T4 GPU.
I am not sure where to add --dtype=half into the command or modify source file to fix this issue. Could you please help me? Thank you in advance!
In the implementation of VLLM generator, the default setting for dtype is' auto '. The following is an explanation about this parameter in VLLM:
The data type for the model weights and activations. Currently, we support `float32`, `float16`, and `bfloat16`. If `auto`, we use
the `torch_dtype` attribute specified in the model config file. However, if the `torch_dtype` in the config is `float32`, we will use `float16` instead.
FlashRAG currently does not support modifying the dtype of model loading in config files. You can achieve this by modifying the reading of vllm models in our source code:
- In line 164-167 of flashrag/generator/generator.py, add
dtype='float16'.
Hello,
Thanks for your answer above! It solves the Bfloat16 problem.
However, I encountered another issue:
(VllmWorkerProcess pid=4500) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I am not quite sure how this error occurs. I tried to modify the code inside run_exp.py and use mp.set_start_method('spawn', force=True) before a main call, but this doesn't work.
Can you help me with this? Thank you!
ps: a quick question, what GPU do you recommend to run the repo? Is one A100 40G enough or I need one A100 80G.
It seems to be a VLLM related issue, you can try the following two solutions:
- Check vllm related processes currently running in the system and kill them (sometimes vllm is not closed correctly). And restart the
run_exp.py - Try setting
export VLLM_WORKER_MULTIPROC_METHOD=spawn(from vllm issue: https://github.com/vllm-project/vllm/issues/6152)
For 7B/13B model, A100 40G is enough.
Hi,
Thanks again for your quick reply. I tried your second suggestion above, and it works.
I am not 100% sure, but I think we need to run export VLLM_WORKER_MULTIPROC_METHOD=spawn whenever the number of GPU is greater than 1.
Thank you for your attempt. We didn't encounter such problems when testing vllm before. I will take a closer look and add it to the code if necessary.