llama2.c
llama2.c copied to clipboard
Running on GPU with OpenMP offloading
I have tried to add minimal changes in the code to run on gpu using openmp offloading. Currently only the matmul happens on GPU. Other operations are CPU.
There are few tmp variables introduced since offloading weights (i.e w->k) which point to virtual address & are only converted to actual address during runtime is not possible.
I don't think this qualifies for a merge, but still open to suggestions.
The new run.c can be run with gcc -fopenmp -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -no-pie -O3 run.c -lm. You might require to install gcc-offload-nvptx if this doesn't compile
The goal of OpenMP is to make the port simple, but if does not work with the memory pointers, I don't know if it helps that much
Do you have some benchmarks?
Hello,
I compiled with icx to use my intel gpu in my fedora 38 laptop. I do see my gpu being used but I didn't see any improvement in performance
COMPILE
#cd $/HOMEintel/oneapi #. ./setvars.sh #cd - (set -x; icx -fiopenmp -fopenmp-targets=spir64 -fcf-protection=none -fno-stack-protector -no-pie -O3 $1 -lm -o run)
RUN
./draft/run llama2_7b.bin
I'm getting better performance with with base code without gpu offloading running it like this
OMP_NUM_THREADS=4 ./run llama2_7b.bin
Maybe I'm doing something run.
Awesome project btw
I am getting this error:
~/llama2.c-omp-cuda$ gcc -fopenmp -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -no-pie -O3 run.c -lm
ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'
nvptx-as: ptxas returned 255 exit status
mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-9 returned 1 exit status
compilation terminated.
Using this docker image: nvidia/cuda:12.2.0-devel-ubuntu20.04
The same error with RTX 3090 and RTX 4090
OK, I was able to build using image nvidia/cuda:11.8.0-devel-ubuntu22.04
And adding the -foffload=-misa=sm_35 to the command line
git clone https://github.com/RahulSChand/llama2.c-omp-cuda
cd llama2.c-omp-cuda
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
sudo apt update
sudo apt install -y gcc-offload-nvptx
gcc -fopenmp -foffload=nvptx-none -fcf-protection=none -foffload=-misa=sm_35 -fno-stack-protector -no-pie -O3 run.c -lm -o llama2
./llama2 stories110M.bin
But it is soooo slow, that I had to Ctrl+C it
@alexeadem Not much familiar with intel GPUs. What is the GPU that your laptop has? Also, there is a little bug that this code has. It is missing a keyword in a directive. So, what I think is happening is GPU is not utilized in its full potential now.
@kroggen The code has a little bug on the line 202 :]. It states #pragma parallel for reduction(+:val) which misses an omp between parallel and pragma. 2nd loop essentially was not being parallelized at all I think.
Edit: But the performance issues will probably remain even after fix because we still move data back and forth too much maybe. It would be nice to run a profiler and see where the bottlenecks are.
I commented that line with // and the result was the same
Have you tried it? Is it also slow with you? In my test it is way slower than just CPU
And I used an RTX 3090 with PCIe 4.0 with 24 GB/s bandwidth
I tried to map the items from the state struct to the GPU memory but it is also slow and the output is gibberish
size_t size_qkvo = (size_t) config.dim * config.dim * config.n_layers;
size_t size_ffn = (size_t) config.hidden_dim * config.dim * config.n_layers;
size_t size_cls = (size_t) config.dim * config.vocab_size;
size_t size_dim = (size_t) config.dim;
size_t size_hidden_dim = (size_t) config.hidden_dim;
size_t size_att = (size_t) config.n_heads * config.seq_len;
size_t size_logits = (size_t) config.vocab_size;
size_t size_cache = (size_t) config.n_layers * config.seq_len * config.dim;
#pragma omp target data map(to:weights.wq[0:size_qkvo], weights.wk[0:size_qkvo], weights.wv[0:size_qkvo], weights.wo[0:size_qkvo], weights.w1[0:size_ffn], weights.w2[0:size_ffn], weights.w3[0:size_ffn], weights.wcls[0:size_cls])
#pragma omp target data map(to:state.x[0:size_dim], state.xb[0:size_dim], state.xb2[0:size_dim], state.hb[0:size_hidden_dim], state.hb2[0:size_hidden_dim], state.q[0:size_dim], state.k[0:size_dim], state.v[0:size_dim], state.att[0:size_att], state.key_cache[0:size_cache], state.value_cache[0:size_cache]) map(tofrom:state.logits[0:size_logits])
@alexeadem Not much familiar with intel GPUs. What is the GPU that your laptop has? Also, there is a little bug that this code has. It is missing a keyword in a directive. So, what I think is happening is GPU is not utilized in its full potential now.
@Ea0011 intel-gpu-top: Intel Alderlake_p (Gen12) @ /dev/dri/card1 on a IBM X1
@kroggen Yes it was really slow for me. About 14 tok/s on 110M model which is not great. I replaced that line with #pragma omp simd reduction (+:val) and it dramatically improved. Now it is faster with 60tok/s than my laptop 6 core CPU. I am running it on the free colab with T4 GPU by the way which isn't the fastest thing. I really think the bottleneck is synchronization between GPU and CPU states. I am thinking of ways to do more compute on GPU between data transfer periods.
I think we should use something like nvprof to profile the bottleneck though. I'll try to set it up on the cloud but am mot sure if this will work.
About moving the state to GPU. I think you get gibberish because we need to properly synchronize with CPU coppies and such.
@alexeadem you could also try changing line 202 with #pragma omp simd reduction(+:val) and see if that speeds things up for you.
@alexeadem you could also try changing line 202 with
#pragma omp simd reduction(+:val)and see if that speeds things up for you.
hello @Ea0011 tried that and I get a warning:
:: initializing oneAPI environment ...
build.sh: BASH_VERSION = 5.2.15(1)-release
args: Using "$@" for setvars.sh arguments: run.c
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: tbb -- latest
:: oneAPI environment initialized ::
/home/alex/git/llama2.c/draft
+ icx -fiopenmp -fopenmp-targets=spir64 -fcf-protection=none -fno-stack-protector -no-pie -O3 run.c -lm -o run
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
And there is no performance improvement. Maybe Intel doesn't support that...
@alexeadem Interesting. I am really not familiar with intel stuff. On CUDA it did help but still this is not optimal still. I have a question though. Can I offload the same way you did for intel hd 600 series GPU s for instance or is it only for new GPUs?
In my case it is like 1 token/second. Cannot even wait it to finish printing
intel hd 600 series GPU
@Ea0011 it should
A quick test in Linux:
glmark2
sudo intel_gpu_top
And see if you gpu kicks in.
If it does you should be able to to use icx
with
icx -fiopenmp -fopenmp-targets=spir64 icpx -fiopenmp -fopenmp-targets=spir64
to enable openmp
Offloading the weights to vram works really well, but the run-state struct might need some refactoring to fit OpenMPs data mapping requirements?
Even when limiting the offloading to q, k, v I could not get valid results.
I suspect two issues.
The call to matmul() with variying dimensions doesn't map correctly to the state in vram, when the state is offloaded early?
The OpenMP memory boundaries have to be clearly defined with target enter and target exit in the transformer function. Or all computations on the state variables need to be offloaded, not just matmul().
Any ideas on how to improve the state struct for easier offloading?
Is there still a chance this can be merged?