vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Usage]: LLama-3.1-405B Inference with vLLM TPU

Open ryanaoleary opened this issue 4 months ago • 0 comments

Your current environment

Collecting environment information... INFO 10-03 20:20:36 importing.py:10] Triton not installed; certain GPU-related functions will not be available. PyTorch version: 2.5.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64) GCC version: (Debian 10.2.1-6) 10.2.1 20210110 Clang version: Could not collect CMake version: version 3.30.3 Libc version: glibc-2.31

Python version: 3.10.14 (main, Aug 13 2024, 02:16:06) [GCC 10.2.1 20210110] (64-bit runtime) Python platform: Linux-6.1.100+-x86_64-with-glibc2.31 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU @ 2.20GHz Stepping: 0 CPU MHz: 2199.998 BogoMIPS: 4399.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 2 MiB L3 cache: 55 MiB NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] pyzmq==26.2.0 [pip3] torch==2.5.0 [pip3] torch-xla==2.5.0+git17a4ef5 [pip3] torchvision==0.19.0a0+d23a6e1 [pip3] transformers==4.45.1 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.dev15+g3b00b9c2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect

How would you like to use vllm

I've tried running LLama-3.1-405B with TPU slice sizes up to 4x4x8 v4 and 8x16 v5e and ran into a few issues:

  1. As slice sizes grow larger, the amount of time needed for vLLM initialization and memory profiling grows incredibly large
  2. Attempting to run inference with smaller topologies than the aforementioned slice sizes leads to errors like RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space hbm. Used 20.44G of 15.75G hbm. Exceeded hbm capacity by 4.70G., since the TPUs only have 16 GB and 32 GB for v5e and v4 TPUs respectively. Relatively small HBM capacity (compared to GPUs) means that we need much larger slice sizes to fit the sharded weights.
  3. Even when loading a model that will be sharded (i.e. with tensor-parallelism > 1), vLLM still downloads the entire model on each worker, only afterwards storing the relevant weights in new files to each worker. This means that larger slice sizes will require an extremely high amount of total disk space when loading large models.
  4. Larger multi-host slice sizes lead to ValueError: Too large swap space. errors, where vLLM attempts to allocate more than the total amount of available CPU memory to the swap space. I've gotten around this error by simply setting swap_space=0 in the EngineArgs, but I'm worried this slows down the model loading.
  5. vLLM lacks support for running multiple multi-host TPU slices (i.e. just with specifying pipeline-parallelism > 1)
  6. vLLM TPU backend lacks support for loading quantized models (e.g. https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4)

I want to serve and run inference of LLama-3.1-405B on TPUs. My specific ask is whether I'm missing anything obvious that would ameliorate the problems I encountered above. Something that'd be really helpful for me would be whether it's possible to load only the relevant weights/shards to each Ray worker such that the required total disk space could be reduced.

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

ryanaoleary avatar Oct 03 '24 20:10 ryanaoleary