exllama
exllama copied to clipboard
Fix half2 with HIP
The current version of CUDA allows you to access the component halfs of half2 through half2.x and half2.y, but in HIP x and y are unsigned shorts and not half or _float16. This replaces accessing those variables with intrinsic functions.
With the changes I can use --force_half2 and still get sensible text with my AMD GPU.
This doesn't work for me. Still spouting gibberish with -fh2. I also do not understand your change for __compat_h2rcp, it's just a backport from ROCm 5.6 for those that use outdated version of ROCm, it can just be removed if using ROCm >= 5.6.
That sucks. I thought I had figured it out. I'm using a 5700XT with rocblas and pytorch 2.0.1 recompiled on ROCm 5.4.3, so it's possible it's avoiding something else that happens related to feature differences like the dot product opcodes it doesn't have.
With __compat_h2rcp as from ROCM5.6, I got gibberish with silu_mul_cuda_kernel with half2. A basic test kernel that calls that __compat_h2rcp gives the wrong result, and has a v_cvt_f16_u16_e32 in its disassembly so it's converting from a short to a 16bit float. Compiling it in a rocm 5.6 docker has the same result, including for the h2rcp() that's builtin to rocm. Targeting gfx1030 produces the same object code as gfx1010 so I suspect there's a bug with AMD's h2rcp.
I spent a bit more time testing your patch. It seems to be a bit more coherent or at least different than without it.
Without patch:
-- Testing 8 chunks.
** Perplexity (reconstruct): 32000.1813
-- Testing 8 chunks.
** Perplexity (quant, token): 31321.9178
** Generation: 'To be or not to be, that is theMDbrnлия Sud Beauoin Burg stick BourbourgrourutobreFD Stock ReserveMDbMDbMDbMDb'
With patch:
-- Testing 8 chunks.
** Perplexity (reconstruct): 5.4959
-- Testing 8 chunks.
** Perplexity (quant, token): 292330.3358
** Generation: 'To be or not to be, that is the............\n – – – –– – noiseokal drag next kv Ron Ep Brook future'
As you can see Perplexity (reconstruct) seem correct with your patch. I looked a bit more, and while using your patch, disabling fused MLP with --fused_mlp_thd 0 make it work.
So, there are still some half2 problem in fused MLP.
So, there are still some half2 problem in fused MLP.
Do they actually matter, though? Maybe you could benchmark it with and without --silu_no_half2 to see if it's even worthwhile to keep fighting with h2exp and h2rcp. The activation kernel is very memory-bound anyway.
@ardfork can you try clearing the contents of the exllama_ext cache? In my case it was in ~/.cache/torch_extensions/py311_cpu/exllama_ext
I tried first running exllama without the patch and with the cache deleted, and then ran with the patch without deleting the cache and got gibberish. After clearing the cache it worked.
With using functions to access the component halfs of half2, undefining HIP_NO_HALF_CONVERSIONS doesn't seem to be necessary anymore.
That work now. Look like the change in hip_compat.cuh weren't taken into account when rebuilding. I wonder if adding a useless change in q4_mlp to force it to be rebuild would be useful (or find another way to force a rebuild of it), since people are going to have the same problem as I had.
As for speed, on my machine it is the same using half2 or not, but it's nice to be able to use everything.
It might be a good idea to raise an issue or a PR upstream to get that h2rcp fixed, as they already tried to fix it in 5.6.
If they are no performance drop on most HIP supported platform, I think we should revert #afc8b7cd. And also revert https://github.com/oobabooga/text-generation-webui/commit/3c076c3c8096fa83440d701ba4d7d49606aaf61f once it use a version of exllama with your patch merged.
If they are no problem with using __low2half and __high2half on CUDA, I think it would be nice to merge that PR, having all the paths working on HIP is better.
Changing extra_cuda_cflags caused ninja to compile everything again too, so I don't think there's a need to change q4_mlp this time.
I raised an issue with hipamd for h2rcp(). https://github.com/ROCm-Developer-Tools/clr/issues/8
I went and added a newline to q4_mlp.cu anyways in case someone using exllama downstream is using their own code for loading the extension. I removed hip defaulting to no_half2 like you suggested.
I would assume that performance should be fine for anyone with an RDNA, CDNA, or Vega GPU. Anything from the RX 580 era lacks the packed math instructions so half2 might be slower for them depending on what the compiler does.
I raised an issue with hipamd for h2rcp()
They have moved hipamd to https://github.com/ROCm-Developer-Tools/clr, might be a good idea to also put your issue on that new repo, I fear that it might get forgotten on the old one.