WeClone icon indicating copy to clipboard operation
WeClone copied to clipboard

logits[logits <= 0] *= torch.where(prompt_mask | output_mask,

Open Floral opened this issue 7 months ago โ€ข 0 comments

WARNING 05-19 23:02:54 [interface.py:303] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. INFO 05-19 23:02:54 [cuda.py:291] Using Flash Attention backend. INFO 05-19 23:02:55 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 05-19 23:02:55 [model_runner.py:1110] Starting to load model ./Qwen2.5-7B-Instruct... INFO 05-19 23:02:56 [loader.py:1155] Loading weights with BitsAndBytes quantization. May take a while ... Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [01:02<03:06, 62.09s/it] Loading safetensors checkpoint shards: 50% Completed | 2/4 [02:03<02:03, 61.51s/it] Loading safetensors checkpoint shards: 75% Completed | 3/4 [03:03<01:00, 60.87s/it] Loading safetensors checkpoint shards: 100% Completed | 4/4 [04:02<00:00, 60.21s/it] Loading safetensors checkpoint shards: 100% Completed | 4/4 [04:02<00:00, 60.63s/it]

INFO 05-19 23:06:58 [model_runner.py:1146] Model loading took 5.2045 GB and 243.222897 seconds INFO 05-19 23:07:01 [worker.py:267] Memory profiling takes 2.10 seconds INFO 05-19 23:07:01 [worker.py:267] the current vLLM instance can use total_gpu_memory (15.99GiB) x gpu_memory_utilization (0.90) = 14.39GiB INFO 05-19 23:07:01 [worker.py:267] model weights take 5.20GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 7.73GiB. INFO 05-19 23:07:01 [executor_base.py:111] # cuda blocks: 9048, # CPU blocks: 4681 INFO 05-19 23:07:01 [executor_base.py:116] Maximum concurrency for 3072 tokens per request: 47.12x INFO 05-19 23:07:02 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Capturing CUDA graph shapes: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 35/35 [00:31<00:00, 1.13it/s] INFO 05-19 23:07:33 [model_runner.py:1570] Graph capturing finished in 31 secs, took 0.64 GiB INFO 05-19 23:07:33 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 34.62 seconds INFO 05-19 23:07:34 [xgrammar_decoding.py:191] Qwen model detected, consider set guided_backend=xgrammar:disable-any-whitespace to prevent runaway generation of whitespaces. Processed prompts: 0%| | 0/689 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/bin/weclone-cli", line 10, in [rank0]: sys.exit(cli()) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in call [rank0]: return self.main(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in main [rank0]: rv = self.invoke(ctx) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1830, in invoke [rank0]: return _process_result(sub_ctx.command.invoke(sub_ctx)) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1226, in invoke [rank0]: return ctx.invoke(self.callback, **ctx.params) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 794, in invoke [rank0]: return callback(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/cli.py", line 26, in wrapper [rank0]: return func(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/cli.py", line 47, in qa_generator [rank0]: processor.main() [rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/data/qa_generator.py", line 98, in main [rank0]: self.clean_strategy.judge(qa_res) [rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/data/clean/strategies.py", line 46, in judge [rank0]: outputs = infer( [rank0]: File "/mnt/e/Documents/copyme/WeClone/weclone/core/inference/vllm_infer.py", line 132, in infer [rank0]: results = LLM(**engine_args).generate(inputs, sampling_params, lora_request=lora_request) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/utils.py", line 1072, in inner [rank0]: return fn(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 465, in generate [rank0]: outputs = self._run_engine(use_tqdm=use_tqdm) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1375, in _run_engine [rank0]: step_outputs = self.llm_engine.step() [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1434, in step [rank0]: outputs = self.model_executor.execute_model( [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 139, in execute_model [rank0]: output = self.collective_rpc("execute_model", [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc [rank0]: answer = run_method(self.driver_worker, method, args, kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2255, in run_method [rank0]: return func(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model [rank0]: output = self.model_runner.execute_model( [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1798, in execute_model [rank0]: output: SamplerOutput = self.model.sample( [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 480, in sample [rank0]: next_tokens = self.sampler(logits, sampling_metadata) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 262, in forward [rank0]: logits = apply_penalties(logits, sampling_tensors.prompt_tokens, [rank0]: File "/mnt/e/Documents/copyme/WeClone/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/utils.py", line 52, in apply_penalties [rank0]: logits[logits <= 0] *= torch.where(prompt_mask | output_mask, [rank0]: RuntimeError: CUDA error: unknown error [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Processed prompts: 0%| | 0/689 [02:14<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [rank0]:[W519 23:11:30.800708962 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

ๆœ‰ไบบ้‡ๅˆฐ่ฟ‡่ฟ™ไธช้”™่ฏฏๅ˜›๏ผŸ

Floral avatar May 19 '25 16:05 Floral