LocalAI Better Support for AMD and ROCM via docker containers.

Presently it is very hard to get a docker container to build with the rocm backend, some elements seem to fail independently during the build process. There are other related projects with functional docker implementations that do work with rocm out of the box (aka llama.cpp). I would like to work on this myself however between the speed at which things change in this project and the amount of time I have free to work on this, I am left only to ask for this.

If there are good 'stable' methods for building a docker implementation with rocm underneath already it would be very appreciated if this could be better documented. 'arch' helps nobody that wants to run on a more enterprisy os like rhel or sles.

Presently I have defaulted back to using textgen as it has a mostly functional api but its featureset is kinda woeful. (better than running llama.cpp directly imo)

Jan 15 '24 01:01 jamiemoller

ps. love the work @mudler

Jan 15 '24 01:01 jamiemoller

it should be noted 1 - the documentation for rocm for some reason indicates make BUILD_TYPE=hipblas GPU_TARGETS=gfx1030 ... there is no build arg 2 - stablediffusion is the hardest thing to get working in any environment ive tested. as in i have yet to actually get it to build on arch, deb, or opensuse 3 - the following dockerfile is the smoothest ive had it build so far

FROM archlinux

# Install deps
# ncnn not required as stablediffusion build is broken
RUN pacman -Syu --noconfirm
RUN pacman -S --noconfirm base-devel git rocm-hip-sdk rocm-opencl-sdk opencv clblast grpc go ffmpeg ncnn

# Configure Lib links
ENV CGO_CFLAGS="-I/usr/include/opencv4" \
    CGO_CXXFLAGS="-I/usr/include/opencv4" \
    CGO_LDFLAGS="-L/opt/rocm/hip/lib -lamdhip64 -L/opt/rocm/lib -lOpenCL -L/usr/lib -lclblast -lrocblas -lhipblas -lrocrand -lomp -O3 --rtlib=compiler-rt -unwindlib=libgcc -lhipblas -lrocblas --hip-link"

# Configure Build settings
ARG BUILD_TYPE="hipblas"
ARG GPU_TARGETS="gfx906" # selected for RadeonVII
ARG GO_TAGS="tts" # stablediffusion is broken

# Build
RUN git clone https://github.com/go-skynet/LocalAI
WORKDIR /LocalAI
RUN make BUILD_TYPE=${BUILD_TYPE} GPU_TARGETS=${GPU_TARGETS} GO_TAGS=${GO_TAGS} build

# Clean up
RUN pacman -Scc --noconfirm

Jan 15 '24 12:01 jamiemoller

it should be noted that while i do see models load onto the card whenever there is an api call and there are computations being performed pushing the card to 200W of consumption there is never any return from the api call and the apparent inference never terminates

Jan 16 '24 02:01 jamiemoller

Presently it is very hard to get a docker container to build with the rocm backend, some elements seem to fail independently during the build process. There are other related projects with functional docker implementations that do work with rocm out of the box (aka llama.cpp). I would like to work on this myself however between the speed at which things change in this project and the amount of time I have free to work on this, I am left only to ask for this.

I don't have an AMD card to test, so this card is up-for-grabs.

Things are moving fast, right, but building-wise this is a good time window, there aren't plans to do changes in that code area in the short-term.

If there are good 'stable' methods for building a docker implementation with rocm underneath already it would be very appreciated if this could be better documented. 'arch' helps nobody that wants to run on a more enterprisy os like rhel or sles.

A good starting point would be in this section: https://github.com/mudler/LocalAI/blob/9c2d2649796907006568925d96916437f5845aac/Dockerfile#L159 we can pull RocM dependencies in there if the appropriate flag was passed by

Jan 16 '24 08:01 mudler

@jamiemoller you could use https://github.com/wuxxin/aur-packages/blob/main/localai-git/PKGBUILD as a starting point, its a (feature limited) archlinux package of localai for CPU, CUDA and ROCM. There are binaries available via arch4edu. See https://github.com/mudler/LocalAI/issues/1437

Jan 16 '24 13:01 wuxxin

Please do work on that. I'm trying to put any load on AMD GPU for week now. Building from source on Ubuntu for clBlast fails in so many ways it's not even funny.

Jan 31 '24 22:01 Expro

i have a feeling that it will be better to start from here (or something) for amd builds now that 2.8 is on the ubu22.04

Feb 14 '24 00:02 jamiemoller

did some progress on https://github.com/mudler/LocalAI/pull/1595 (thanks to @fenfir to have started this up) but I don't have an AMD video card, however CI seems to pass and container images are being built just fine.

I will merge as soon as the v2.8.2 images are out - @jamiemoller @Expro could you give the images a shot as soon as they are on master?

Feb 15 '24 22:02 mudler

Sure, I will take them for spin. Thanks for working on that.

Feb 16 '24 12:02 Expro

hipblas images are pushed by now:

quay.io/go-skynet/local-ai:master-hipblas-ffmpeg-core

Feb 17 '24 09:02 mudler

Unfortunately, not working as intended. GPU was detected, but nothing was offloaded:

4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ggml_init_cublas: found 1 ROCm devices: 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /build/models/c0c3c83d0ec33ffe925657a56b06771b (version GGUF V3 (latest)) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 0: general.architecture str = phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 1: general.name str = Phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 2: phi2.context_length u32 = 2048 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 5: phi2.block_count u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 10: general.file_type u32 = 7 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 19: general.quantization_version u32 = 2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - type f32: 195 tensors 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - type q8_0: 130 tensors 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ). 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: format = GGUF V3 (latest) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: arch = phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: vocab type = BPE 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_vocab = 51200 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_merges = 50000 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_ctx_train = 2048 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_head = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_head_kv = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_layer = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_rot = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_head_k = 80 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_head_v = 80 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_gqa = 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_k_gqa = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_v_gqa = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_norm_eps = 1.0e-05 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_norm_rms_eps = 0.0e+00 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_ff = 10240 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_expert = 0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_expert_used = 0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: rope scaling = linear 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: freq_base_train = 10000.0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: freq_scale_train = 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_yarn_orig_ctx = 2048 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: rope_finetuned = unknown 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model type = 3B 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model ftype = Q8_0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model params = 2.78 B 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model size = 2.75 GiB (8.51 BPW) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: general.name = Phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: BOS token = 50256 '<|endoftext|>' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: EOS token = 50256 '<|endoftext|>' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: UNK token = 50256 '<|endoftext|>' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: LF token = 128 'Ä' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: ggml ctx size = 0.12 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: offloading 0 repeating layers to GPU 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: offloaded 0/33 layers to GPU 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: ROCm_Host buffer size = 2819.28 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ............................................................................................. 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: n_ctx = 512 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: freq_base = 10000.0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: freq_scale = 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_kv_cache_init: ROCm_Host KV buffer size = 160.00 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: ROCm_Host input buffer size = 6.01 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: ROCm_Host compute buffer size = 115.50 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: graph splits (measure): 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr Available slots: 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr -> Slot 0 - max context: 512 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr all slots are idle and system prompt is empty, clear the KV cache 4:14PM INF [llama-cpp] Loads OK 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr slot 0 is processing [task id: 0] 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr slot 0 : kv cache rm - [0, end) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr CUDA error: shared object initialization failed 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr current device: 0, in function ggml_cuda_op_mul_mat at /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:9462 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr hipGetLastError() 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr GGML_ASSERT: /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:241: !"CUDA error"

Tested with integrated phi-2 model with gpu_layers specified:

` name: phi-2 context_size: 2048 f16: true gpu_layers: 90 mmap: true trimsuffix:

"\n" parameters: model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf temperature: 0.2 top_k: 40 top_p: 0.95 seed: -1 template: chat: &template | Instruct: {{.Input}} Output: completion: *template

usage: | To use this model, interact with the API (in another terminal) with curl for instance: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "phi-2", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] }' `

Feb 20 '24 16:02 Expro

the rocm docker image does appear to load the model however there is a grpc error that I have encountered that causes the call to terminate before inference, i am moving to 22.04 with rocm 6.0.0 on the host make sure there are no version compatibility issues.

Note: the new vulkan implementation of llama.cpp seems to work flawlessly

Mar 02 '24 06:03 jtwolfe

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere?

Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

Apr 02 '24 16:04 derzahla

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere?

Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

Apr 07 '24 05:04 jtwolfe

the rocm docker image does appear to load the model however there is a grpc error that I have encountered that causes the call to terminate before inference, i am moving to 22.04 with rocm 6.0.0 on the host make sure there are no version compatibility issues.

Note: the new vulkan implementation of llama.cpp seems to work flawlessly

I think I just discovered the cause of my issue... I am running my Radeon VII for this workload this would be a gfx906 device presently i find only GPU_TARGETS ?= gfx900,gfx90a,gfx1030,gfx1031,gfx1100 in the makefile regarding this gfx900 is not supported for rocm v5.>> or v6.0.0

I have yet to test if a tailored build including gfx906 will work but this may be a good candidate for inclusion in the next hipblas build details

for reference currently under 6.0.0 the following llbm targets are supported gfx942,gfx90a,gfx908,gfx906,gfx1100,gfx1030 I would not for clarity that the gfx906 target is depreciated for the instinct MI50 but not for the radeon pro vii or the radeon vii, add to this that the instinct MI25 is the only gfx900 card and is noted as no longer supported, while I do think we should keep gfx900 in place for as long as possible it may impact future builds

I may not have time to test an amendment to the GPU_TARGETS for the next few weeks (I only have like 2 hrs free today and after building my gpu into a single node k8s cluster I need to configure a local container registry before I can test any custom builds :( )

@fenfir might you be able to test this?

Apr 07 '24 05:04 jtwolfe

the rocm docker image does appear to load the model however there is a grpc error that I have encountered that causes the call to terminate before inference, i am moving to 22.04 with rocm 6.0.0 on the host make sure there are no version compatibility issues. Note: the new vulkan implementation of llama.cpp seems to work flawlessly

I think I just discovered the cause of my issue... I am running my Radeon VII for this workload this would be a gfx906 device presently i find only GPU_TARGETS ?= gfx900,gfx90a,gfx1030,gfx1031,gfx1100 in the makefile regarding this gfx900 is not supported for rocm v5.>> or v6.0.0

I have yet to test if a tailored build including gfx906 will work but this may be a good candidate for inclusion in the next hipblas build details

for reference currently under 6.0.0 the following llbm targets are supported gfx942,gfx90a,gfx908,gfx906,gfx1100,gfx1030 I would not for clarity that the gfx906 target is depreciated for the instinct MI50 but not for the radeon pro vii or the radeon vii, add to this that the instinct MI25 is the only gfx900 card and is noted as no longer supported, while I do think we should keep gfx900 in place for as long as possible it may impact future builds

I may not have time to test an amendment to the GPU_TARGETS for the next few weeks (I only have like 2 hrs free today and after building my gpu into a single node k8s cluster I need to configure a local container registry before I can test any custom builds :( )

@fenfir might you be able to test this?

ok so fyi current master-hipblas-ffmpeg-core image with GPU_TARGETS=gfx906 does not build

[  0%] Building C object CMakeFiles/ggml.dir/ggml.c.o
[  1%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[  1%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[  2%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
[  2%] Building CXX object CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o
clang++: error: invalid target ID 'gfx903'; format is a processor name followed by an optional colon-delimited list of features followed by an enable/disable sign (e.g., 'gfx908:sramecc+:xnack-')
gmake[4]: *** [CMakeFiles/ggml.dir/build.make:132: CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o] Error 1
2024-04-07T15:31:29.842216496+10:00 gmake[4]: Leaving directory '/build/backend/cpp/llama/llama.cpp/build'
gmake[3]: *** [CMakeFiles/Makefile2:842: CMakeFiles/ggml.dir/all] Error 2
gmake[3]: Leaving directory '/build/backend/cpp/llama/llama.cpp/build'
2024-04-07T15:31:29.842808442+10:00 gmake[2]: *** [Makefile:146: all] Error 2
2024-04-07T15:31:29.842836792+10:00 gmake[2]: Leaving directory '/build/backend/cpp/llama/llama.cpp/build'
make[1]: *** [Makefile:75: grpc-server] Error 2
make[1]: Leaving directory '/build/backend/cpp/llama'
make: *** [Makefile:517: backend/cpp/llama/grpc-server] Error 2

EDIT: 'waaaaaaiiiiit a second' I think im retarded... EDIT2: yep im definately retarded, setting the environment var GPU_TARGETS=gfx906 worked fine, not i just need to get my model and context right <3 @mudler @fenfir <3 can we pls get gfx906 added to the default targets pls

Apr 07 '24 05:04 jtwolfe

@Expro take a look at my previous posts, maybe they will help you solve this, ping me if you like, maybe I can help

Apr 07 '24 05:04 jtwolfe

@mudler before i spend the time, are there any immediate plans for expanded k8s docs or AMD specific docs?

Apr 07 '24 06:04 jtwolfe

@mudler before i spend the time, are there any immediate plans for expanded k8s docs or AMD specific docs?

Hey @jtwolfe , thanks for deep diving into this, I don't have an AMD card to test things out so I refrained to write documentation that I couldn't test with. Any help on that area is greatly appreciated.

Apr 07 '24 08:04 mudler

@mudler before i spend the time, are there any immediate plans for expanded k8s docs or AMD specific docs?

Hey @jtwolfe , thanks for deep diving into this, I don't have an AMD card to test things out so I refrained to write documentation that I couldn't test with. Any help on that area is greatly appreciated.

ack. I'll do my best to try and get some of AMD brethren to test some more edge cases so we can give some more details on modern cards but I will send a PR up for docs when I get time.

Apr 07 '24 08:04 jtwolfe

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere? Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

i hope you're using containers \winkyface

it appears that the AMD advice regarding 'dowards compatability" is correct ie. I am currently running 6.0.2 on my server and while the container works on 6.0.0 and have yet to have any issues

if you wish to keep your server driver up to date as long as the major version is the same between the host and the container and the host minor version is greater than the containers then you should not have any problems

really there should not be an issue in either direction with minor version updates however there is the potential for baser operations to be invalidated accidentally via implementation in whatever program that makes the calls. this said I would still recommend keeping to the AMD standard

i would recommend for comparability sake that we keep the container rocm version at 6.0.0 until such time that there is a breaking change that stops this backwards compatibility

Apr 09 '24 02:04 jamiemoller

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere?

Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

@derzahla i would not recommend building it from scratch. grab the hipblas image and pass it the REBUILD=true var also if you have issues after the rebuild check the llvm target for your card and pass in GPU_TARGETS=gfx$WHATEVER

find the llvm target for your gpu https://llvm.org/docs/AMDGPUUsage.html#processors then check the comparability with rocm https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.0.0/reference/system-requirements.html

this should work, im lucky to have a card thats directly referenced on the rocm supported gpu list but I expect that any chip associated with the llvm target should work (ie gfx1030 includes rx6800, rx6800xt and rx6900xt but according to amd If a GPU is not listed on this table, It’s not officially supported by AMD.)

Apr 09 '24 03:04 jamiemoller

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

There still does not seem to be any release notes out for 6.0.3, but since I have a gfx1103 which isn't officially supported up through 6.0.2, I was hoping maybe it was added in 6.0.3.

However, I have had success with ollama by setting "HSA_OVERRIDE_GFX_VERSION=11.0.2" ( on rocm 6.0.2 & 6.0.3, at least)

I initially tried setting REBUILD=true and it didn't help. That's why I was trying to find the actual Dockerfile used to generate the hipblas registry containers. I can try running with REBUILD=true again and post details of the results

Apr 09 '24 14:04 derzahla

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

There still does not seem to be any release notes out for 6.0.3, but since I have a gfx1103 which isn't officially supported up through 6.0.2, I was hoping maybe it was added in 6.0.3.

However, I have had success with ollama by setting "HSA_OVERRIDE_GFX_VERSION=11.0.2" ( on rocm 6.0.2 & 6.0.3, at least)

I initially tried setting REBUILD=true and it didn't help. That's why I was trying to find the actual Dockerfile used to generate the hipblas registry containers. I can try running with REBUILD=true again and post details of the results

hmmmm

https://www.reddit.com/r/ROCm/comments/1b36sjj/support_for_gfx1103/ there is a note here indicating that maybe if compiled for gfx1100 there may be a path but from what i see the gfx1103 is an integrated graphics solution/mGPU (is that the case for you?).

if it is im inclined to think that this may be a harder problem than you'd like. as I understand it there are architectural changes regarding memory management for AMD APU that may proclude if from being easily compilable with rocm.

have you had a look at vllm with rocm? https://docs.vllm.ai/en/latest/getting_started/amd-installation.html you may have some success with a single inference tool? (beware I have had it eat >70GB of memory during the docker build for the rocm supporting image)

personally i would love to see a implementation of localai with vulkan however this is all dependent on upstream project support. and for this i expect that there may be a considerable amount of 'hackery' and 'overhead' related losses that may make this a considerable time sink for developers :(

PS. if this is a mobile gpu I would ask what the cost/benefit for this looks like? while it would be good for people without access to performant machines I expect a better solution would be to find an eGPU chassis on ebay and fill it with a cheap rx6600/rx7600 or the like.

PPS. I have used LMStudio on my Legion GO with its Z1 and while it did work 'sometimes' (memory allocation I think) i did not get any better performance than doing straight CPU inference on of my 7950X systems (~12+-5 tokens/s)

Apr 10 '24 01:04 jamiemoller

@jamiemoller Interestingly the LLM function seems to work if I recompile for gfx1100 as you mentioned and change HSA_OVERRIDE_GFX_VERSION to 11.0.0. I wonder if gfx1102 and HSA_OVERRIDE_GFX_VERSION=11.0.2" would work with an upgraded rocm to >= 6.0.2.

Yes my gfx1103 is an iGPU but its not mobile. I have an Radeon 8600G in an ATX case, so I can upgrade to a more powerful GPU easy enough but I wanted to push the limits of this iGPU first and see if it would be sufficient.

I have not tried vLLM but thanks for making me aware of it. ollama works very nicely for LLM functionality. One of things I was looking forward to with localai is AI Art integrations with stable diffusion and tinydream. Stablediffusion still pukes on the rebuilt container with:

7:46PM DBG GRPC(stablediffusion_assets-127.0.0.1:35289): stderr /tmp/localai/backend_data/backend-assets/grpc/stablediffusion: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory

So again, it would be nice if someone could point me to the Dockerfile's used to build the hipblas images so I could modify them for my needs.

Apr 12 '24 19:04 derzahla

@derzahla

last question first:

im pretty sure all of the image configurations are set in the github actions workflows in the repo im more of a gitlab ci guy myself but it looks pretty simple, just check out https://github.com/mudler/LocalAI/tree/master/.github/workflows if you take a look at image.yml image_build.yml image-pr.yml and release.yaml you will find the all the details regarding overrides for the build process

good to know that there is a workaround for "hotfix'ed" target versions, strange tho that youre having the sd issue I'm currently looking into imagegen myself but havent had any luck so far from memory most image gen implementations use rocm 5.x and use a custom version of some python library i cant remember the name of that emulates cuda enablement (pytorch)

im working my way though the feature list now to test for docs

so far ive tested working textgen (gpu) tts (gpu) \ i think piper hit lik 5% of my gpu for about 2.5s to generate the first 20% of the turbo encabulator talk sst (cpu) / whisper is fast on anything vision (gpu)

embeddings - was doing something funny because transformers diffusion - \shrug - still investigating

edit; for some reason diffusers-rocm.yml does not note the --extra-index-url as per pytorch docs https://pytorch.org/get-started/locally/ unsure if this has any impact as /rocm6.0/* forwards to /* in the same index url

edit2: i have found and replicated your limomp.so issue I'm having a hard time whats calling it tho also no the easy 'just install the library' solution doesn't seem to work atm i think there another dependency somewhere thats expecting it as a prereq

2024-04-13T15:55:25.964632811+10:00 5:55AM DBG GRPC(stablediffusion_assets-127.0.0.1:41555): stderr /tmp/localai/backend_data/backend-assets/grpc/stablediffusion: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory

edit3; so it appears that the libomp.so library issue only occurs on SD in cpu mode (ai the aio/cpu/image-gen.yaml) on using the aio/gpu-8g/image-gen.yaml there appears another error which results in a connection error from grpc

7:13AM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:40605): stderr /build/backend/python/diffusers/run.sh: line 13: activate: No such file or directory
7:13AM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:40605): stderr /build/backend/python/diffusers/run.sh: line 19: python: command not found

this specifically refers to

ln 13 source activate diffusers

ln 19 python $DIR/backend_diffusers.py $@

i have found that /opt/conda is straight up not availible bingo so it looks like theres some python stuff missing then, so what next? I have switched to the non*-core* image since if memory serves core removes some python related things to slim down the image.

now my problem is downloading a 20gb image at 'quay speed'

edit4: ... yep

non-core image fixes gpu requirements for diffusion, cpu didnt work for some reason still (i expect its the model)
some other minor build weirdness has disappeared too so i expect things should be smooth-ish for most of the python implementations
still haven't tested embeddings
can we re-add the python elements post hoc? rather than having a huge image?

image so large i have to move my models to another disk :|

Apr 13 '24 05:04 jtwolfe

@derzahla I think there might be some reduced featureset for igpu (my bets on memory adjacent) that is a bit of a sticking point atm in drivers. news was that rocm5.7? was dropping support of a bunch of cards soon so im really not sure how much compatibility were going to get with older chips with non-"ai-specific-architecture".

cheeky soln - if you can make vgpu work on your host system just find some good tools and run the independently, automatic1111 and oobagooba come to mind ;) split the api with a proxy

Apr 13 '24 10:04 jtwolfe

@derzahla i apologize but I was incorrect there is still an issue with SD i see in my testing it appears that as I was testing the aio models i did not realise that the cpu and gpu examples actually use different backends. the functional gpu model makes use of diffusers meanwhile the cpu mpdel makes use of stablediffusion, presently I trust the diffusers backend more than the stablediffusion one as it seems that sd is just a prebuilt repo and executes it entirely separately to the diffusers backend, as such the bug is probably in the upstream repo from @EdVince

I am inclined to ask @mudler if he is aware of any reasons why this may not be working? (also if youre listening @mudler i seem to recall my testing around v2.0 on cpu would jettison unused models if there was not enough memory, then complete loading the model, this is not working for gpu atm :| any ideas? like @derzahla noted about SD could it possibly be the rebuild)

But either way the gpu accelerated model using the diffusers backend seems to be working without issue

its also worth noting that the intel solution has a different configuration again so im unsure if that will work either

I swear my headstone will read still testing

Apr 16 '24 10:04 jtwolfe

Hi, I have a Radeon VII and was able to get it working on localai. I did have to make some tweaks to get it to build and use gfx906 however...

# docker-compose.yaml
    image: quay.io/go-skynet/local-ai:v2.12.4-aio-gpu-hipblas
    environment:
      - DEBUG=true
      - REBUILD=true
      - BUILD_TYPE=hipblas
      - GPU_TARGETS=gfx906
    devices:
      - /dev/dri
      - /dev/kfd

Cheers

Apr 18 '24 11:04 bunder2015

@bunder2015 when you say 'it' do you mean the container or the 'stablediffusion' backend. also would you mind listing any of the aio defined models and if they offload to gpu? any details you can confirm with testing would be appreciated.

also also; i have had issues with using the 'cloned-voice' backend. it is currently giving me an error due to a missing opencl library this is in the same fashion s the missing as the libopm.so issue for sd.

any detail would be appreciated

also fyi i am using the GO_TAGS="stablediffusion tinydream tts" and DEBUG="true" for my rebuild of the 'non-core' and 'non-aio' 'latest' master image

Apr 18 '24 23:04 jamiemoller