WARNING 05-19 23:02:54 [interface.py:303] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 05-19 23:02:54 [cuda.py:291] Using Flash Attention backend.
INFO 05-19 23:02:55 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-19 23:02:55 [model_runner.py:1110] Starting to load model ./Qwen2.5-7B-Instruct...
INFO 05-19 23:02:56 [loader.py:1155] Loading weights with BitsAndBytes quantization. May take a while ...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [01:02<03:06, 62.09s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [02:03<02:03, 61.51s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [03:03<01:00, 60.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [04:02<00:00, 60.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [04:02<00:00, 60.63s/it]
INFO 05-19 23:06:58 [model_runner.py:1146] Model loading took 5.2045 GB and 243.222897 seconds
INFO 05-19 23:07:01 [worker.py:267] Memory profiling takes 2.10 seconds
INFO 05-19 23:07:01 [worker.py:267] the current vLLM instance can use total_gpu_memory (15.99GiB) x gpu_memory_utilization (0.90) = 14.39GiB
INFO 05-19 23:07:01 [worker.py:267] model weights take 5.20GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 7.73GiB.
INFO 05-19 23:07:01 [executor_base.py:111] # cuda blocks: 9048, # CPU blocks: 4681
INFO 05-19 23:07:01 [executor_base.py:116] Maximum concurrency for 3072 tokens per request: 47.12x
INFO 05-19 23:07:02 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 35/35 [00:31<00:00, 1.13it/s]
INFO 05-19 23:07:33 [model_runner.py:1570] Graph capturing finished in 31 secs, took 0.64 GiB
INFO 05-19 23:07:33 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 34.62 seconds
INFO 05-19 23:07:34 [xgrammar_decoding.py:191] Qwen model detected, consider set guided_backend=xgrammar:disable-any-whitespace to prevent runaway generation of whitespaces.
Processed prompts: 0%| | 0/689 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/bin/weclone-cli", line 10, in
[rank0]: sys.exit(cli())
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in call
[rank0]: return self.main(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in main
[rank0]: rv = self.invoke(ctx)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1830, in invoke
[rank0]: return _process_result(sub_ctx.command.invoke(sub_ctx))
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1226, in invoke
[rank0]: return ctx.invoke(self.callback, **ctx.params)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 794, in invoke
[rank0]: return callback(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/cli.py", line 26, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/cli.py", line 47, in qa_generator
[rank0]: processor.main()
[rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/data/qa_generator.py", line 98, in main
[rank0]: self.clean_strategy.judge(qa_res)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/data/clean/strategies.py", line 46, in judge
[rank0]: outputs = infer(
[rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/core/inference/vllm_infer.py", line 132, in infer
[rank0]: results = LLM(**engine_args).generate(inputs, sampling_params, lora_request=lora_request)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/utils.py", line 1072, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 465, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1375, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1434, in step
[rank0]: outputs = self.model_executor.execute_model(
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 139, in execute_model
[rank0]: output = self.collective_rpc("execute_model",
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2255, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
[rank0]: output = self.model_runner.execute_model(
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1798, in execute_model
[rank0]: output: SamplerOutput = self.model.sample(
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 480, in sample
[rank0]: next_tokens = self.sampler(logits, sampling_metadata)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 262, in forward
[rank0]: logits = apply_penalties(logits, sampling_tensors.prompt_tokens,
[rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/utils.py", line 52, in apply_penalties
[rank0]: logits[logits <= 0] *= torch.where(prompt_mask | output_mask,
[rank0]: RuntimeError: CUDA error: unknown error
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Processed prompts: 0%| | 0/689 [02:14<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[rank0]:[W519 23:11:30.800708962 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ๆไบบ้ๅฐ่ฟ่ฟไธช้่ฏฏๅ๏ผ