b1tg comments

Results 20 comments of


                                            b1tg

AMD_LLVM speed

comgr enabled cumode by passing -mcumode, verifed by llvm-dis the TEMP .bc file, there are +cumode flag in IR also, enable or disable cumode didn't effect the AMD backend speed...

comgr use following flags: (basiclly match [this](https://github.com/ROCm/llvm-project/blob/b154bfa63698a4c35756c839a22108962b708165/clang/test/CodeGenOpenCL/amdgpu-features.cl#L100) ) ```f'attributes #0 = {{ convergent mustprogress norecurse nounwind "amdgpu-flat-work-group-size"="1,{requiredMaxThreadsPerBlock}" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+cumode,+dl-insts,+dot10-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="true" }}'``` AMD_LLVM got 3h36m using this, speed-up but...

AMD_LLVM speed

AMD_LLVM got 3h18m after https://github.com/tinygrad/tinygrad/pull/9680/commits/de93902b57f7dfb97348816cc676266e0e0a6a78 , 4.2% slower than AMD backend(3h10m)

AMD_LLVM speed

after nsw flag, AMD_LLVM use 3h06m(BERT_LAYERS=2)/12h23m(BERT_LAYERS=24), AMD backend use 3h10m/12h08m

AMD_LLVM speed

Base on my experience, no. When I was debugging the slower cases, I made a "HIP_LLVM" backend, using hipcc compile hip to llvmir, then compile to amdgpu with AMDLLVMCompiler, those...

default AMD_LLVM=1

https://github.com/tinygrad/tinygrad/pull/10267 fix the speed_v_theoretical.py timeout in ci

Build Improvements - Find NCCL.h automatically from pypi nvidia-nccl-cu12/cu13

not sure it's relevant, I encounter `nccl.h: No such file or directory` today when I install from source, fixed by `sudo apt install libnccl-dev`

amd tc 1616128

this depends on https://github.com/tinygrad/tinygrad/pull/13438 to get speedup after that, run `BEAM=3 CNT=1 AMD_LLVM=0 DEBUG=2 FP8E4M3=1 SHOULD_USE_TC=1 python extra/gemm/simple_matmul.py` on mi350x, should get > 500 TFLOPS (if not, run with IGNORE_BEAM_CACHE=1)

Flash Attention forward + backward on MI300X + MI350X (memory)

~I have fixed these two issues, will pr~ not the same bug..