Sycl: No kernel named _ZTSZZL17rms_norm_f32_ was found Intel ARC A770
LocalAI version:
v2.22.1
Environment, CPU architecture, OS, and Version:
Ubuntu 22.04 Linux gpubench 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 2 16:16:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Intel ARC A770 (requires newer drivers/etc to be correctly identified the on the containers or locally)
Have tried Sycl containers, and building locally on the machine
Describe the bug
Trying to get a new ARC A770 working with Sycl through LocalAI, it does require quite new drivers/etc to make it recognized fully. I have tried the containers and building locally on the machine.
I hit this error on trying to run any model:
However running just llama.cpp on same machine/setup/model it works perfectly fine.
To Reproduce
I can reproduce this by running the Sycl based containers and running an 'apt-get update and apt-get upgrade' to get the drivers/intel openapi/etc up to date to recognize this card correctly (else shows up with 256mb of memory and won't run a model)
Or locally by having the same drivers installed - again to correctly recognize the card
Expected behavior
To be able to load and run the models correctly on an Intel Arc A770, as it does run fine on llama.cpp by itself
Logs
9:25PM INF Loading model 'Mistral-v0.3-7B-Q3_K_M' with backend llama-cpp-grpc 9:25PM DBG Loading model in memory from file: /var/localai/models/Mistral-7B-Instruct-v0.3.Q3_K_M.gguf 9:25PM DBG Loading Model Mistral-v0.3-7B-Q3_K_M with gRPC (file: /var/localai/models/Mistral-7B-Instruct-v0.3.Q3_K_M.gguf) (backend: llama-cpp-grpc): {backendString:llama-cpp-grpc model:Mistral-7B-Instruct-v0.3.Q3_K_M.gguf modelID:Mistral-v0.3-7B-Q3_K_M assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000c8c008 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false} 9:25PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-grpc 9:25PM DBG GRPC Service for Mistral-v0.3-7B-Q3_K_M will be running at: '127.0.0.1:33321' 9:25PM DBG GRPC Service state dir: /tmp/go-processmanager561768969 9:25PM DBG GRPC Service Started 9:25PM DBG Wait for the service to start up 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stdout Server listening on 127.0.0.1:33321 9:25PM DBG GRPC Service Ready 9:25PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:Mistral-7B-Instruct-v0.3.Q3_K_M.gguf ContextSize:12288 Seed:1763834203 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:33 MainGPU: TensorSplit: Threads:10 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/var/localai/models/Mistral-7B-Instruct-v0.3.Q3_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false} 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr ggml_sycl_init: SYCL_USE_XMX: yes 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr ggml_sycl_init: found 1 SYCL devices: 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_load_model_from_file: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /var/localai/models/Mistral-7B-Instruct-v0.3.Q3_K_M.gguf (version GGUF V3 (latest)) 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 0: general.architecture str = llama 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 1: general.name str = models--mistralai--Mistral-7B-Instruc... 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 2: llama.block_count u32 = 32 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 3: llama.context_length u32 = 32768 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 10: general.file_type u32 = 12 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 11: llama.vocab_size u32 = 32768 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 13: tokenizer.ggml.model str = llama 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 14: tokenizer.ggml.pre str = default 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[... 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32768] = [0.000000, 0.000000, 0.000000, 0.0000... 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32768] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 23: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 24: general.quantization_version u32 = 2 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 25: quantize.imatrix.file str = ./imatrix.dat 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 26: quantize.imatrix.dataset str = group_40.txt 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 27: quantize.imatrix.entries_count i32 = 224 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - kv 28: quantize.imatrix.chunks_count i32 = 74 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - type f32: 65 tensors 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - type q3_K: 129 tensors 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - type q4_K: 92 tensors 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - type q5_K: 4 tensors 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_model_loader: - type q6_K: 1 tensors 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_vocab: special tokens cache size = 771 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_vocab: token to piece cache size = 0.1731 MB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: format = GGUF V3 (latest) 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: arch = llama 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: vocab type = SPM 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_vocab = 32768 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_merges = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: vocab_only = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_ctx_train = 32768 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_embd = 4096 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_layer = 32 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_head = 32 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_head_kv = 8 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_rot = 128 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_swa = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_embd_head_k = 128 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_embd_head_v = 128 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_gqa = 4 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_embd_k_gqa = 1024 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_embd_v_gqa = 1024 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: f_norm_eps = 0.0e+00 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: f_logit_scale = 0.0e+00 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_ff = 14336 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_expert = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_expert_used = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: causal attn = 1 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: pooling type = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: rope type = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: rope scaling = linear 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: freq_base_train = 1000000.0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: freq_scale_train = 1 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: n_ctx_orig_yarn = 32768 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: rope_finetuned = unknown 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: ssm_d_conv = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: ssm_d_inner = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: ssm_d_state = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: ssm_dt_rank = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: ssm_dt_b_c_rms = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: model type = 7B 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: model ftype = Q3_K - Medium 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: model params = 7.25 B 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: model size = 3.28 GiB (3.89 BPW) 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: general.name = models--mistralai--Mistral-7B-Instruct-v0.3 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: BOS token = 1 '<s>' 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: EOS token = 2 '</s>' 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: UNK token = 0 '<unk>' 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: LF token = 781 '<0x0A>' 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: EOG token = 2 '</s>' 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_print_meta: max token length = 48 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_tensors: ggml ctx size = 0.27 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_tensors: offloading 32 repeating layers to GPU 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_tensors: offloading non-repeating layers to GPU 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_tensors: offloaded 33/33 layers to GPU 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_tensors: SYCL0 buffer size = 3304.02 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llm_load_tensors: CPU buffer size = 55.00 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr ................................................................................................. 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: n_ctx = 12288 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: n_batch = 512 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: n_ubatch = 512 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: flash_attn = 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: freq_base = 1000000.0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: freq_scale = 1 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr [SYCL] call ggml_check_sycl 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr ggml_check_sycl: GGML_SYCL_DEBUG: 0 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr ggml_check_sycl: GGML_SYCL_F16: no 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr found 1 SYCL devices: 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr | | | | |Max | |Max |Global | | 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr | | | | |compute|Max work|sub |mem | | 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr |ID| Device Type| Name|Version|units |group |group|size | Driver version| 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr | 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.3.29735+27| 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_kv_cache_init: SYCL0 KV buffer size = 1536.00 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: SYCL0 compute buffer size = 824.00 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: SYCL_Host compute buffer size = 32.01 MiB 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: graph nodes = 1030 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr llama_new_context_with_model: graph splits = 2 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) 9:25PM DBG GRPC(Mistral-v0.3-7B-Q3_K_M-127.0.0.1:33321): stderr No kernel named _ZTSZZL17rms_norm_f32_syclPKfPfiifPN4sycl3_V15queueEiENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_ was foundException caught at file:/home/ubuntu/Github/LocalAI/backend/cpp/llama-grpc/llama.cpp/ggml/src/ggml-sycl.cpp, line:3546
Additional context
Hi!
Thanks for bringing this up - what's not really clear to me is: is it working for you after running apt-get upgrade in the LocalAI container?
No neither through the containers (even after doing updates/upgrades) or building locally from source. Ends up in that same no kernel found. llama.cpp directly from source DOES work though. Figure I am doing something wrong, or this specific A770 is just too new since requires updated drivers/openapi to even recognize it as a 16GB card. Makes me love my Nvidia cards just that much more 🤣
No neither through the containers (even after doing updates/upgrades) or building locally from source. Ends up in that same no kernel found.
Ok that's a good data point!
llama.cpp directly from source DOES work though.
Did you tried from llama.cpp containers? It could be a mismatch in drivers between the one in the container images and the ones in the host. I remember I had the same issue here until I've used the correct version of the drivers in my host.
Figure I am doing something wrong, or this specific A770 is just too new since requires updated drivers/openapi to even recognize it as a 16GB card.
I have a couple of those as well and used to work, but since my cluster is now in maintenance mode I can't test it. I will be able to test in a week or two.
I'll try the llama.cpp container directly. llama.cpp from source did work on the machine with same model and ran fairly well.
I stopped trying to get LocalAI to run in containers, instead just direct on the machine to eliminate any issues with Docker/Kubernetes. Just to take that out of the equation and start to build LocalAI from source also, the same kernel error.
I had tried to match drivers/intel oneapi between host and containers (updated ones) as this specific card does require newer drivers/intel oneapi kit to recognize it as 16GB (else only shows up as 256mb 🤦 and was fairly frustrating to get it to recognize it correctly)
No rush at all, was just trying to get this ASRock Intel Arc A770 16GB through some testing/etc. Let me know if I can provide any other information/etc for it. I have several machines/GPUs/etc, if I can help support the project in other ways happy to provide it.
To follow up, I could never get LocalAI to function with this card correctly even compiling from source. Llama.cpp I can by updating the intel one api kit/drivers and re-compiling. The docker containers have same issue since calling the older intel one api image as base. So basically, I think it all comes down to the intel one api kit and drivers needing to be updated across the board for this specific card to really function. Which I am sure also has it's own problems that will be introduced.
https://github.com/ggerganov/llama.cpp/issues/10113 - for reference seems to be same issue on the surface. So can understand if you want to just close this issue for now.
@mudler I'm getting a very similar error trying to run any model on my Arc A380 GPU using LocalAI 2.26.0 (quay.io/go-skynet/local-ai:latest-gpu-intel-f16), and as with OP running the same model via llama.cpp (ghcr.io/ggerganov/llama.cpp:light-intel) works fine.
11:25PM DBG GRPC(deepseek-r1-distill-qwen-1.5b-127.0.0.1:44899): stderr No kernel named _ZTSZZL17rms_norm_f32_syclPKfPfiifPN4sycl3_V15queueEiENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_ was foundException caught at file:/build/backend/cpp/llama-fallback/llama.cpp/ggml/src/ggml-sycl/common.cpp, line:99
I've got the 2025 intel oneAPI basekit installed on the machine, which should match what's in both docker images.
Any idea what could be the difference between both approaches?