WeClone 使用T4 GPU时报错：需要支持半精度

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the dtype flag in CLI, for example: --dtype=half.

直接在cli后加--dtype=half无法传入参数

May 16 '25 19:05 TimoZhou1024

这个报错是哪个库报的

May 17 '25 00:05 xming521

[WeClone] I | 19:45:41 | 开始使用llm对数据打分 [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file vocab.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file merges.txt [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2313] 2025-05-16 19:45:46,922 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:697] 2025-05-16 19:45:46,923 >> loading configuration file ./Qwen2.5-7B-Instruct/config.json [INFO|configuration_utils.py:771] 2025-05-16 19:45:46,924 >> Model config Qwen2Config { "_name_or_path": "./Qwen2.5-7B-Instruct", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,925 >> loading file vocab.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,925 >> loading file merges.txt [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,925 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,925 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,925 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,925 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,925 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2313] 2025-05-16 19:45:47,495 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:697] 2025-05-16 19:45:47,591 >> loading configuration file ./Qwen2.5-7B-Instruct/config.json [INFO|configuration_utils.py:697] 2025-05-16 19:45:47,591 >> loading configuration file ./Qwen2.5-7B-Instruct/config.json [INFO|configuration_utils.py:771] 2025-05-16 19:45:47,592 >> Model config Qwen2Config { "_name_or_path": "./Qwen2.5-7B-Instruct", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|image_processing_auto.py:301] 2025-05-16 19:45:47,596 >> Could not locate the image processor configuration file, will try to use the model config instead. INFO 05-16 19:45:59 [config.py:585] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'. WARNING 05-16 19:45:59 [arg_utils.py:1854] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. INFO 05-16 19:45:59 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='./Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='./Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=./Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:59,661 >> loading file vocab.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:59,661 >> loading file merges.txt [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:59,661 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:59,661 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:59,661 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:59,661 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:59,661 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2313] 2025-05-16 19:46:00,034 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:1093] 2025-05-16 19:46:00,147 >> loading configuration file ./Qwen2.5-7B-Instruct/generation_config.json [INFO|configuration_utils.py:1140] 2025-05-16 19:46:00,147 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 }

INFO 05-16 19:46:01 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 05-16 19:46:01 [cuda.py:288] Using XFormers backend. Traceback (most recent call last): File "/usr/local/bin/weclone-cli", line 10, in sys.exit(cli()) ^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1442, in call return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1363, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1830, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1226, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 794, in invoke return callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/content/WeClone/weclone/cli.py", line 26, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/content/WeClone/weclone/cli.py", line 47, in qa_generator processor.main() File "/content/WeClone/weclone/data/qa_generator.py", line 98, in main self.clean_strategy.judge(qa_res) File "/content/WeClone/weclone/data/clean/strategies.py", line 46, in judge outputs = infer( ^^^^^^ File "/content/WeClone/weclone/core/inference/vllm_infer.py", line 128, in infer results = LLM(**engine_args).generate(inputs, sampling_params, lora_request=lora_request) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 1037, in inner return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 243, in init self.llm_engine = LLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args return engine_cls.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config return cls( ^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 280, in init self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 52, in init self._init_executor() File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor self.collective_rpc("init_device") File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2255, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 604, in init_device self.worker.init_device() # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 157, in init_device _check_if_gpu_supports_dtype(self.model_config.dtype) File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 526, in _check_if_gpu_supports_dtype raise ValueError( ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the dtype flag in CLI, for example: --dtype=half.

May 17 '25 03:05 TimoZhou1024

腾讯云免费算力T4GPU同样报错

May 18 '25 14:05 Murphy-ZZH

配置里的enable_clean设为false，不清洗数据集试试

May 19 '25 02:05 xming521

这个原因是因为bfloat16 (BF16) 是一种较新的浮点数格式，需要GPU硬件计算能力 >= 8.0才能原生支持，T4包括v100好像都不行。在settings.jsonc中增加下面的参数，然后重新执行weclone-cli make-dataset试试呢？

"infer_args": { "repetition_penalty": 1.2, "temperature": 0.5, "max_length": 50, "top_p": 0.65, "infer_dtype": "float16" // 添加这一行 }

May 23 '25 13:05 BAIKEMARK

[INFO|tokenization_utils_base.py:2048] 2025-05-16 19:45:46,398 >> loading file vocab.json

再不行就直接改weclone/data/clean/strategies.py，找到下面这段代码然后插入修改。这个方法实测可行。如果方便的话麻烦帮忙测试一下上面settings.jsonc中增加的参数可不可行。

outputs = vllm_infer( inputs, self.make_dataset_config["model_name_or_path"], template=self.make_dataset_config["template"], temperature=0, guided_decoding_class=QaPairScore, repetition_penalty=1.2, bad_words=[r"\n"], vllm_config= json.dumps({"dtype": "float16"}) # 这里直接传递表达式的结果 )

May 23 '25 13:05 BAIKEMARK

这个原因是因为bfloat16 (BF16) 是一种较新的浮点数格式，需要GPU硬件计算能力 >= 8.0才能原生支持，T4包括v100好像都不行。在settings.jsonc中增加下面的参数，然后重新执行weclone-cli make-dataset试试呢？

"infer_args": { "repetition_penalty": 1.2, "temperature": 0.5, "max_length": 50, "top_p": 0.65, "infer_dtype": "float16" // 添加这一行 }

我在 Colab 上测试了，修改 settings.jsonc 后还是会报同样的错误，但是直接改 strategies.py 没问题

May 25 '25 15:05 duolanda