aphrodite-engine [Bug]: Another Bug while starting (Global)

Your current environment

kaggle 2 t4s

🐛 Describe the bug

`WARNING: Casting torch.bfloat16 to torch.float16. WARNING: Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by Aphrodite. Disabling sliding window and capping the max length to the sliding window size (4096). WARNING: Casting torch.bfloat16 to torch.float16. INFO: Defaulting to use mp for distributed inference. INFO:

INFO: Initializing Aphrodite Engine (v0.6.0 commit 75122b2) with the following config: INFO: Model = 'lemon07r/Gemma-2-Ataraxy-9B' INFO: DataType = torch.float16 INFO: Tensor Parallel Size = 2 INFO: Pipeline Parallel Size = 1 INFO: Disable Custom All-Reduce = False INFO: Context Length = 4096 INFO: Enforce Eager Mode = True INFO: Prefix Caching = False INFO: Device = device(type='cuda') INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines') INFO:

tokenizer_config.json: 100%|███████████████| 40.6k/40.6k [00:00<00:00, 5.14MB/s] tokenizer.model: 100%|█████████████████████| 4.24M/4.24M [00:00<00:00, 31.9MB/s] tokenizer.json: 100%|██████████████████████| 17.5M/17.5M [00:00<00:00, 81.4MB/s] special_tokens_map.json: 100%|█████████████████| 636/636 [00:00<00:00, 3.58MB/s] WARNING: Reducing Torch parallelism from 2 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO: Using Flashinfer backend. (AphroditeWorkerProcess pid=302) INFO: Using Flashinfer backend. (AphroditeWorkerProcess pid=302) INFO: Worker ready; awaiting tasks INFO: generating GPU P2P access cache in /root/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json INFO: Port 2242 is already in use, trying port 2243 INFO: Loading model lemon07r/Gemma-2-Ataraxy-9B... INFO: Using Flashinfer backend. (AphroditeWorkerProcess pid=302) INFO: Using Flashinfer backend. INFO: Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 100%|██████| 5/5 [01:05<00:00, 13.08s/it] INFO: Model loaded in 138.08 seconds. INFO: Weights memory usage: 8.81 GiB x 2 = 17.63 GiB INFO: Profiling peak memory usage... Process Process-1: ERROR: Worker AphroditeWorkerProcess pid 302 died, exit code: -15 INFO: Killing local Aphrodite worker processes Traceback (most recent call last): File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/aphrodite/endpoints/openai/rpc/server.py", line 204, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, port) File "/opt/conda/lib/python3.10/site-packages/aphrodite/endpoints/openai/rpc/server.py", line 23, in init self.engine = AsyncAphrodite.from_engine_args(async_engine_args) File "/opt/conda/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 470, in from_engine_args engine = cls( File "/opt/conda/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 379, in init self.engine = self._init_engine(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 550, in _init_engine return engine_class(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 257, in init self._initialize_kv_caches() File "/opt/conda/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 313, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/opt/conda/lib/python3.10/site-packages/aphrodite/executor/distributed_gpu_executor.py", line 35, in determine_num_available_blocks num_blocks = self._run_workers("determine_num_available_blocks", ) File "/opt/conda/lib/python3.10/site-packages/aphrodite/executor/multiproc_gpu_executor.py", line 189, in _run_workers driver_worker_output = driver_worker_method(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/aphrodite/task_handler/worker.py", line 189, in determine_num_available_blocks self.model_runner.profile_run() File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/aphrodite/task_handler/model_runner.py", line 969, in profile_run self.execute_model(model_input, kv_caches, intermediate_tensors) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/aphrodite/task_handler/model_runner.py", line 1355, in execute_model BatchDecodeWithPagedKVCacheWrapper( TypeError: 'NoneType' object is not callable /opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '`

Sep 06 '24 03:09 Abdulhanan535

Please perform the instructions in the issue template and run the env.py script so I can see what environment you're working with. I have no idea what the default kaggle environment looks like, or how you've modified it.

Sep 06 '24 03:09 AlpinDale

okay..

Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.15.154+-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.3.107 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla T4 GPU 1: Tesla T4

Nvidia driver version: 550.90.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU @ 2.00GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 Stepping: 3 BogoMIPS: 4000.40 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 64 KiB (2 instances) L1i cache: 64 KiB (2 instances) L2 cache: 2 MiB (2 instances) L3 cache: 38.5 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Versions of relevant libraries: [pip3] flake8==7.0.0 [pip3] flashinfer==0.1.6+cu118torch2.4 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] onnx==1.16.2 [pip3] optree==0.11.0 [pip3] pytorch-ignite==0.5.1 [pip3] pytorch-lightning==2.4.0 [pip3] torch==2.4.0 [pip3] torchaudio==2.4.0 [pip3] torchinfo==1.8.0 [pip3] torchmetrics==1.4.1 [pip3] torchvision==0.19.0 [pip3] triton==3.0.0 [conda] flashinfer 0.1.6+cu118torch2.4 pypi_0 pypi [conda] magma-cuda121 2.6.1 1 pytorch [conda] mkl 2024.2.1 ha957f24_103 conda-forge [conda] numpy 1.26.4 py310hb13e2d6_0 conda-forge [conda] optree 0.11.0 pypi_0 pypi [conda] pytorch-ignite 0.5.1 pypi_0 pypi [conda] pytorch-lightning 2.4.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchaudio 2.4.0 pypi_0 pypi [conda] torchinfo 1.8.0 pypi_0 pypi [conda] torchmetrics 1.4.1 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypiROCM Version: Could not collect Aphrodite Version: 0.6.0 Aphrodite Build Flags: CUDA Archs: Not Set; ROCm: Disabled

Sep 06 '24 04:09 Abdulhanan535