FlashRAG how to change Bfloat16 to float16 (I can only use a GPU with compute capacity >= 8.0 to run this repo?)

when I run the pipeline

python run_exp.py --method_name 'naive' \
                  --split 'test' \
                  --dataset_name 'nq' \
                  --gpu_id '0,1,2,3'

I get this error: in the last line, I need to append --dtype=half in CLI. I think somewhere the code is using Bfloat 16, not the float16 type.

INFO 07-28 00:30:38 config.py:715] Defaulting to use mp for distributed inference
INFO 07-28 00:30:38 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/smalldog/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/home/smalldog/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/smalldog/Meta-Llama-3-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-28 00:30:38 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242034) INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=242034) INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242036) INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=242036) INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242035) INFO 07-28 00:30:38 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=242035) INFO 07-28 00:30:38 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=242035) Process VllmWorkerProcess:
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

I am using 4 T4 GPU.

I am not sure where to add --dtype=half into the command or modify source file to fix this issue. Could you please help me? Thank you in advance!

Jul 28 '24 02:07 codecodebear

In the implementation of VLLM generator, the default setting for dtype is' auto '. The following is an explanation about this parameter in VLLM:

The data type for the model weights and activations. Currently, we support `float32`, `float16`, and `bfloat16`. If `auto`, we use
the `torch_dtype` attribute specified in the model config file. However, if the `torch_dtype` in the config is `float32`, we will use `float16` instead.

FlashRAG currently does not support modifying the dtype of model loading in config files. You can achieve this by modifying the reading of vllm models in our source code:

In line 164-167 of flashrag/generator/generator.py, add dtype='float16'.

Jul 28 '24 02:07 ignorejjj

Hello,

Thanks for your answer above! It solves the Bfloat16 problem.

However, I encountered another issue:

(VllmWorkerProcess pid=4500) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I am not quite sure how this error occurs. I tried to modify the code inside run_exp.py and use mp.set_start_method('spawn', force=True) before a main call, but this doesn't work.

Can you help me with this? Thank you!

ps: a quick question, what GPU do you recommend to run the repo? Is one A100 40G enough or I need one A100 80G.

Jul 28 '24 04:07 codecodebear

It seems to be a VLLM related issue, you can try the following two solutions:

Check vllm related processes currently running in the system and kill them (sometimes vllm is not closed correctly). And restart the run_exp.py
Try setting export VLLM_WORKER_MULTIPROC_METHOD=spawn (from vllm issue: https://github.com/vllm-project/vllm/issues/6152)

For 7B/13B model, A100 40G is enough.

Jul 28 '24 05:07 ignorejjj

Hi,

Thanks again for your quick reply. I tried your second suggestion above, and it works.

I am not 100% sure, but I think we need to run export VLLM_WORKER_MULTIPROC_METHOD=spawn whenever the number of GPU is greater than 1.

Jul 29 '24 01:07 codecodebear

Thank you for your attempt. We didn't encounter such problems when testing vllm before. I will take a closer look and add it to the code if necessary.

Jul 29 '24 02:07 ignorejjj