aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Bug]: clamp broken with HIP

Open cb88 opened this issue 1 year ago • 1 comments

Your current environment

The output of `python env.py` ```text [cb88@M31-AR0 aphrodite]$ python env.py Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64) GCC version: (GCC) 14.2.1 20240910 Clang version: 18.1.8 CMake version: version 3.31.3 Libc version: glibc-2.40

Python version: 3.10.15 (main, Dec 26 2024, 14:29:02) [GCC 14.2.1 20240910] (64-bit runtime) Python platform: Linux-6.12.8-arch1-1-x86_64-with-glibc2.40 Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 7352 24-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 55% CPU max MHz: 3202.9290 CPU min MHz: 1500.0000 BogoMIPS: 4601.58 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es Virtualization: AMD-V L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 12 MiB (24 instances) L3 cache: 128 MiB (8 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.3 [pip3] pytorch-triton-rocm==3.1.0 [conda] Could not collect ROCM Version: 6.2.41134-0 Neuron SDK Version: N/A Aphrodite Version: N/A Aphrodite Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect


Required editing 2 lines in ./kernels/quantization/compressed_tensors/int8_quant_kernels.cu

//dst = std::clamp(dst, i8_min, i8_max);
  dst = std::min<int64_t>(i8_min, i8_max);

With this applied I can run single threaded inference with HIP on MI60 on (v0.6.5 commit 69519285) -tp 2 and -pp 2 do not work though. 

cb88 avatar Jan 08 '25 14:01 cb88

An alternative patch is also listed here with a bit more info.

dst = (dst < i8_min) ? i8_min : (i8_max < dst) ? i8_max : dst;

https://github.com/lamikr/rocm_sdk_builder/commit/c337b2f5da1ebe9c5dcfc799a52e00020ffcf1c0#diff-6c8c5d41df041f13cdbb55ed8e7669bbf686e7093b8d7f3282001c7dada88dbaL39

cb88 avatar Jan 14 '25 17:01 cb88