openvino.genai
openvino.genai copied to clipboard
[Good First Issue]: Create a GGUF reader
The idea is to have a functionality that allows reading GGUF format and creating OpenVINO GenAI compatible representation that can be used to instantiate LLMPipeline() from it. This task includes:
- Parsing GGUF with gguf-tools C++ library: https://github.com/antirez/gguf-tools/
- Using the code similar to MLX to get weights and config. Here is MLX Python API part but we need its C++ functionality: https://github.com/ml-explore/mlx-examples/blob/c117af83b8cbec15523bd0d69e7a57f01237ca89/llms/gguf_llm/models.py#L275
- Creating IR model, OpenVINO tokenizer/detokenizer and config.json on the fly. Here is the POC repository that does this in Python: https://github.com/AlexKoff88/gguf-to-openvino
The initial scope can include support of llama-based LLMs (e.g. llama-3.2 and SmoLMs) and FP16, Q8_0, Q4_0, Q4_1 models. All the code should be written in C++.
Can this be broken down into smaller exact tasks ? This would allow us to pick off tasks one by one and help contributors slowly build something instead of all of at once.
It can be for sure but the way I see it assumes that these tasks should be executed subsequently. For example:
- [ ] One can start by enabling llama-3.2-1b in FP16.
- [ ] Parsing and converting tokenizer from GGUF format to OpenVINO (tokenizer/detokenizer models). After that, we will have core functionality in place.
- Then, a few tasks can be executed in parallel:
- [ ] Enable Q8_0 llama
- [ ] Enable Q4_0 and Q4_1 llama
- [ ] Enable and verify other llama-based models such as Llama-3.1-8B, SmolLMs
- [ ] Enable the most popular quantization schemes such as Q4_K_M
- [ ] Enable Qwen model family
...
.take
Thank you for looking into this issue! Please let us know if you have any questions or require any help.
Hello @AlexKoff88 I would like to work on this can this assigned to me?
Hi @AlexKoff88 ,@ilya-lavrenov , I am able to parse gguf model via using the gguf tools that you provided above howver they are written in C (had to do some linkning on my end) , in the POC you provided in load_gguf_model function
the weights is a very complex map containing vector,<T> vector<vector<T>> (layers) should I continue with similar implementation on my end? please point me if I am headed in the right direction. Also I had doubt as you mentioned earlier
Parsing and converting tokenizer from GGUF format to OpenVINO (tokenizer/detokenizer models). After that, we will have core functionality in place.
do we only need to parse the tokenizer not the whole model ?
Thank you
this is the update on my end
main.cpp
extern "C" {
#include "gguf-tools/gguflib.h"
}
#include "openvino/op/concat.hpp"
#include "openvino/op/constant.hpp"
#include "openvino/op/convert.hpp"
#include "openvino/op/gather_elements.hpp"
#include "openvino/op/unsqueeze.hpp"
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <assert.h>
#include <errno.h>
#include <math.h>
#include <inttypes.h>
#include <iostream>
#include <bits/stdc++.h>
struct {
int verbose; // --verbose option
int diffable; // --diffable option
} Opt = {0};
std::pair<std::map<std::string,int>,std::map<std::string,std::vector<double>>> load_gguf_model(const char *model_path){
std::map<std::string,int> config,meta_data;
std::map<std::string,std::vector<double>> weights;
gguf_ctx *ctx = gguf_open(model_path);
if (ctx == NULL) {
perror(model_path);
exit(1);
}
gguf_key key;
while (gguf_get_key(ctx,&key)) {
meta_data[std::string(key.name,key.namelen)] = key.val->uint32;
gguf_next(ctx,key.type,key.val,Opt.verbose);
}
config.emplace("layer_num",meta_data["llama.block_count"]);
config.emplace("head_num",meta_data["llama.attention.head_count"]);
config.emplace("head_size",meta_data["llama.embedding_length"]/meta_data["llama.attention.head_count"]);
config.emplace("head_num_kv",(meta_data.count("llama.attention.head_count_kv")?meta_data["llama.attention.head_count_kv"]:meta_data["llama.attention.head_count"]));
config.emplace("max_position_embeddings",((meta_data.count("llama.context_length")?meta_data["llama.context_length"]:2048)));
config.emplace("rotary_dims",meta_data["llama.rope.dimension_count"]);
config.emplace("rms_norm_eps",meta_data["llama.attention.layer_norm_rms_epsilon"]);
config.emplace("rope_freq_base",((meta_data.count("llama.rope.freq_base")?meta_data["llama.rope.freq_base"]:10000.0)));
for(auto x : config){
std::cout<<x.first<<" : "<<x.second<<std::endl;
}
return{config,weights};
}
int main(int argc, char* argv[]){
std::cout<<"helloworld]n\n";
std::string filename = argv[1];
std::pair<std::map<std::string,int>,std::map<std::string,std::vector<double>>> model = load_gguf_model(argv[1]);
return 0;
}
make file:
CC=gcc
CXX=g++
CFLAGS=-march=native -ffast-math -g -ggdb -Wall -W -pedantic -O3
INCLUDES=-I./gguf-tools
OBJECTS=gguf-tools/gguflib.o gguf-tools/sds.o gguf-tools/fp16.o
main: $(OBJECTS) main.cpp
$(CXX) $(CFLAGS) $(INCLUDES) main.cpp $(OBJECTS) -o main
%.o: %.c
$(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@
clean:
rm -f main $(OBJECTS)
I am also implementing the model conversion logic as pointed in POC, will update as i finish it. Thank you
@Captain-MUDIT would love to collaborate with you, lets wait for the Alex's response & then we can decide how to proceed.
Sure @11happy
@AlexKoff88 .take
Thanks for being interested in this issue. It looks like this ticket is already assigned to a contributor. Please communicate with the assigned contributor to confirm the status of the issue.
Hi @11happy and @Captain-MUDIT, thank you for your interest.
@11happy, regarding your question about how to load the GGUF in the right way. You can go with how MLX does it so you can add "gguf-tool" to submodules and borrow the code from MLX that parses GGUF with "gguf-tool". Details are here: https://github.com/ml-explore/mlx/blob/main/mlx/io/gguf.cpp
@Captain-MUDIT, you can take the tokenizer conversion part. The task is to transform GGUF tokenizer data to OpenVINO tokenizer. OpenVINO has a dedicated project for converting tokenizers from HF Transformers to OpenVINO format: https://github.com/openvinotoolkit/openvino_tokenizers. The idea is to take tokenizer config, vocab and metadata and use a part of openvino_tokenizers lib to do the conversion. Adding @apaniukov for consultations.
@AlexKoff88 can I also work on this issue?
@janviisonii23, you can as there are a few subtasks in it but you have to wait a bit until the core part is implemented.
@11happy @Captain-MUDIT
There is a TokenizerPipeline class for building tokenizer/detokenizer models. The easiest way is to parse the tokenizer data from .gguf file, make such a pipeline and get models from it, see HF-tiktoken tokenizer example.
You can get an example of what steps are created, by checking a steps attribute of the pipeline object that is created here by converting gguf tokenizer using HF AutoTokenizer class. Note that the resulting tokenizer might not accurately represent GGUF tokenizer because each conversion step (GGUF $\rightarrow$ HF $\rightarrow$ OV) might introduce some errors.
The other way is to build the tokenizer by creating the model graph directly, like in this RWKV example.
You might also have to create several base pipelines for different tokenizer types:
I started my own implementation as no progress in the past month: https://github.com/openvinotoolkit/openvino.genai/pull/1885
The plan is to design a basic functionality to enable LLama-based models in FP16 and INT8.
@11happy @Captain-MUDIT @apaniukov Any update or questions from your side? @p-wysocki In case we don't hear update, should we remark this issue to seek additional contributor? Thanks!
@wenjiew I have been able to load the gguf model on my side referring https://github.com/ml-explore/mlx/blob/main/mlx/io/gguf.cpp also improving my prev implementation, should I make a PR for loading models Thank you
@wenjiew The "policy" is that GFIs are under the management of GFI creators - if you wish to seek more contributors or unassign inactive ones - it's your decision. My role is just to browse the issues every now and then to try to bump people/contributors if they're inactive.
@wenjiew @p-wysocki if additional contributors are needed, I am happy to help
I am almost done with the working prototype in C++ that created inferable OpenVINO IR. It would be great if someone could take a look at how to make OpenVINO tokenizer from the information available in GGUF.
@AlexKoff88 Can i take it ?
Got the accurate results for FP16 SmolLM (llama-based) model in https://github.com/openvinotoolkit/openvino.genai/pull/1885
Got the accurate results for Q8_0 as well.
Hi @AlexKoff88,
I see that you've already implemented the conversion of the Hugging Face tokenizer to OpenVINO format in the save_tokenizer function in the POC repo. It looks like the first approach (GGUF → HF → OpenVINO) is already prototyped in Python.
Would you like me to proceed with prototyping the second approach (directly converting GGUF → OpenVINO tokenizer) or should we stick with the easier approach for now? Let me know how you'd like to move forward (this is my understanding of the 2 approaches, let me know if I missed something)
Also, just to let you know, I’m currently unable to execute the POC because I’m working on a Windows machine, and my Mac is Intel-based, so the MLX library is incompatible with it (only works with Apple Silicon chips).
Hi @Wassim-Hamra,
Also, just to let you know, I’m currently unable to execute the POC because I’m working on a Windows machine, and my Mac is Intel-based, so the MLX library is incompatible with it (only works with Apple Silicon chips).
Have you tried to compile OpenVINO GenAI from the PR that I mentioned? It should work on Mac x86. I personally using Linux x64. No need to install MLX for the PR's implementation. To experiment with Python you can use Google Colab, for example.
Regarding the second workflow, Yes, the idea is to use just GGUF file that contains the tokenizer information, model config, and weights. Here we need to parse the tokenizer information and create the OpenVINO version of the two tokenizer and detokenizer models. In the PR, I already implemented the parser of GGUF and you need to add the tokenizer based part. The entire code should be in C++.
Hey folks, since there’s been a lot of development on this issue, I’d like to unassign myself to make room for other interested contributors to jump in and take it forward, @Wassim-Hamra you are open to take this issue. Thank you
Hey folks, since there’s been a lot of development on this issue, I’d like to unassign myself to make room for other interested contributors to jump in and take it forward, @Wassim-Hamra you are open to take this issue. Thank you
You can take the conversion of Q4_K_M models as it is the most popular quantization format for GGUF. I added support of Q4_0 and Q4_1 in my PR https://github.com/openvinotoolkit/openvino.genai/pull/1885. My proposal is to upconvert unsupported types to higher precision. For example, Q6 and Q5 to INT8.
We also agreed that @apaniukov will prepare a Python-based POC for direct read and conversion of the tokenizer from GGUF to OV Tokenizer library format.
Hey folks, since there’s been a lot of development on this issue, I’d like to unassign myself to make room for other interested contributors to jump in and take it forward, @Wassim-Hamra you are open to take this issue. Thank you
You can take the conversion of Q4_K_M models as it is the most popular quantization format for GGUF. I added support of Q4_0 and Q4_1 in my PR #1885. My proposal is to upconvert unsupported types to higher precision. For example, Q6 and Q5 to INT8.
Thank you , I will look into Q4_K_M