llama2.c icon indicating copy to clipboard operation
llama2.c copied to clipboard

Running on GPU with OpenMP offloading

Open RahulSChand opened this issue 2 years ago • 16 comments

I have tried to add minimal changes in the code to run on gpu using openmp offloading. Currently only the matmul happens on GPU. Other operations are CPU.

There are few tmp variables introduced since offloading weights (i.e w->k) which point to virtual address & are only converted to actual address during runtime is not possible.

I don't think this qualifies for a merge, but still open to suggestions.

The new run.c can be run with gcc -fopenmp -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -no-pie -O3 run.c -lm. You might require to install gcc-offload-nvptx if this doesn't compile

RahulSChand avatar Jul 28 '23 21:07 RahulSChand

The goal of OpenMP is to make the port simple, but if does not work with the memory pointers, I don't know if it helps that much

Do you have some benchmarks?

kroggen avatar Jul 28 '23 21:07 kroggen

Hello,

I compiled with icx to use my intel gpu in my fedora 38 laptop. I do see my gpu being used but I didn't see any improvement in performance

COMPILE

#cd $/HOMEintel/oneapi #. ./setvars.sh #cd - (set -x; icx -fiopenmp -fopenmp-targets=spir64 -fcf-protection=none -fno-stack-protector -no-pie -O3 $1 -lm -o run)

RUN

./draft/run llama2_7b.bin

I'm getting better performance with with base code without gpu offloading running it like this

 OMP_NUM_THREADS=4 ./run llama2_7b.bin

Maybe I'm doing something run.

Awesome project btw

alexeadem avatar Jul 29 '23 17:07 alexeadem

I am getting this error:

~/llama2.c-omp-cuda$ gcc -fopenmp -foffload=nvptx-none -fcf-protection=none  -fno-stack-protector -no-pie -O3 run.c -lm
ptxas fatal   : Value 'sm_30' is not defined for option 'gpu-name'
nvptx-as: ptxas returned 255 exit status
mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-9 returned 1 exit status
compilation terminated.

Using this docker image: nvidia/cuda:12.2.0-devel-ubuntu20.04

The same error with RTX 3090 and RTX 4090

kroggen avatar Jul 30 '23 03:07 kroggen

OK, I was able to build using image nvidia/cuda:11.8.0-devel-ubuntu22.04

And adding the -foffload=-misa=sm_35 to the command line

git clone https://github.com/RahulSChand/llama2.c-omp-cuda
cd llama2.c-omp-cuda
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
sudo apt update
sudo apt install -y gcc-offload-nvptx
gcc -fopenmp -foffload=nvptx-none -fcf-protection=none -foffload=-misa=sm_35 -fno-stack-protector -no-pie -O3 run.c -lm -o llama2
./llama2 stories110M.bin

But it is soooo slow, that I had to Ctrl+C it

kroggen avatar Jul 30 '23 03:07 kroggen

@alexeadem Not much familiar with intel GPUs. What is the GPU that your laptop has? Also, there is a little bug that this code has. It is missing a keyword in a directive. So, what I think is happening is GPU is not utilized in its full potential now.

Ea0011 avatar Jul 30 '23 16:07 Ea0011

@kroggen The code has a little bug on the line 202 :]. It states #pragma parallel for reduction(+:val) which misses an omp between parallel and pragma. 2nd loop essentially was not being parallelized at all I think.

Edit: But the performance issues will probably remain even after fix because we still move data back and forth too much maybe. It would be nice to run a profiler and see where the bottlenecks are.

Ea0011 avatar Jul 30 '23 16:07 Ea0011

I commented that line with // and the result was the same

Have you tried it? Is it also slow with you? In my test it is way slower than just CPU

And I used an RTX 3090 with PCIe 4.0 with 24 GB/s bandwidth

I tried to map the items from the state struct to the GPU memory but it is also slow and the output is gibberish

    size_t size_qkvo = (size_t) config.dim * config.dim * config.n_layers;
    size_t size_ffn = (size_t) config.hidden_dim * config.dim * config.n_layers;
    size_t size_cls = (size_t) config.dim * config.vocab_size;

    size_t size_dim = (size_t) config.dim;
    size_t size_hidden_dim = (size_t) config.hidden_dim;
    size_t size_att = (size_t) config.n_heads * config.seq_len;
    size_t size_logits = (size_t) config.vocab_size;
    size_t size_cache = (size_t) config.n_layers * config.seq_len * config.dim;

    #pragma omp target data map(to:weights.wq[0:size_qkvo], weights.wk[0:size_qkvo], weights.wv[0:size_qkvo], weights.wo[0:size_qkvo], weights.w1[0:size_ffn], weights.w2[0:size_ffn], weights.w3[0:size_ffn], weights.wcls[0:size_cls])
    #pragma omp target data map(to:state.x[0:size_dim], state.xb[0:size_dim], state.xb2[0:size_dim], state.hb[0:size_hidden_dim], state.hb2[0:size_hidden_dim], state.q[0:size_dim], state.k[0:size_dim], state.v[0:size_dim], state.att[0:size_att], state.key_cache[0:size_cache], state.value_cache[0:size_cache]) map(tofrom:state.logits[0:size_logits])

kroggen avatar Jul 31 '23 00:07 kroggen

@alexeadem Not much familiar with intel GPUs. What is the GPU that your laptop has? Also, there is a little bug that this code has. It is missing a keyword in a directive. So, what I think is happening is GPU is not utilized in its full potential now.

@Ea0011 intel-gpu-top: Intel Alderlake_p (Gen12) @ /dev/dri/card1 on a IBM X1

alexeadem avatar Jul 31 '23 04:07 alexeadem

@kroggen Yes it was really slow for me. About 14 tok/s on 110M model which is not great. I replaced that line with #pragma omp simd reduction (+:val) and it dramatically improved. Now it is faster with 60tok/s than my laptop 6 core CPU. I am running it on the free colab with T4 GPU by the way which isn't the fastest thing. I really think the bottleneck is synchronization between GPU and CPU states. I am thinking of ways to do more compute on GPU between data transfer periods.

I think we should use something like nvprof to profile the bottleneck though. I'll try to set it up on the cloud but am mot sure if this will work.

About moving the state to GPU. I think you get gibberish because we need to properly synchronize with CPU coppies and such.

Ea0011 avatar Jul 31 '23 09:07 Ea0011

@alexeadem you could also try changing line 202 with #pragma omp simd reduction(+:val) and see if that speeds things up for you.

Ea0011 avatar Jul 31 '23 09:07 Ea0011

@alexeadem you could also try changing line 202 with #pragma omp simd reduction(+:val) and see if that speeds things up for you.

hello @Ea0011 tried that and I get a warning:

:: initializing oneAPI environment ...
   build.sh: BASH_VERSION = 5.2.15(1)-release
   args: Using "$@" for setvars.sh arguments: run.c
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: tbb -- latest
:: oneAPI environment initialized ::

/home/alex/git/llama2.c/draft
+ icx -fiopenmp -fopenmp-targets=spir64 -fcf-protection=none -fno-stack-protector -no-pie -O3 run.c -lm -o run
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]

And there is no performance improvement. Maybe Intel doesn't support that...

alexeadem avatar Jul 31 '23 15:07 alexeadem

@alexeadem Interesting. I am really not familiar with intel stuff. On CUDA it did help but still this is not optimal still. I have a question though. Can I offload the same way you did for intel hd 600 series GPU s for instance or is it only for new GPUs?

Ea0011 avatar Aug 01 '23 16:08 Ea0011

In my case it is like 1 token/second. Cannot even wait it to finish printing

kroggen avatar Aug 01 '23 17:08 kroggen

intel hd 600 series GPU

@Ea0011 it should

A quick test in Linux:

glmark2
sudo intel_gpu_top

And see if you gpu kicks in.

If it does you should be able to to use icx with

icx -fiopenmp -fopenmp-targets=spir64 icpx -fiopenmp -fopenmp-targets=spir64

to enable openmp

alexeadem avatar Aug 02 '23 15:08 alexeadem

Offloading the weights to vram works really well, but the run-state struct might need some refactoring to fit OpenMPs data mapping requirements?

Even when limiting the offloading to q, k, v I could not get valid results.

I suspect two issues.

The call to matmul() with variying dimensions doesn't map correctly to the state in vram, when the state is offloaded early?

The OpenMP memory boundaries have to be clearly defined with target enter and target exit in the transformer function. Or all computations on the state variables need to be offloaded, not just matmul().

Any ideas on how to improve the state struct for easier offloading?

leuc avatar Oct 01 '23 17:10 leuc

Is there still a chance this can be merged?

ziliangpeng avatar Nov 13 '23 18:11 ziliangpeng