openvino.genai icon indicating copy to clipboard operation
openvino.genai copied to clipboard

[Good First Issue]: Create a GGUF reader

Open AlexKoff88 opened this issue 10 months ago • 40 comments

The idea is to have a functionality that allows reading GGUF format and creating OpenVINO GenAI compatible representation that can be used to instantiate LLMPipeline() from it. This task includes:

The initial scope can include support of llama-based LLMs (e.g. llama-3.2 and SmoLMs) and FP16, Q8_0, Q4_0, Q4_1 models. All the code should be written in C++.

AlexKoff88 avatar Feb 03 '25 11:02 AlexKoff88

Can this be broken down into smaller exact tasks ? This would allow us to pick off tasks one by one and help contributors slowly build something instead of all of at once.

Geeks-Sid avatar Feb 03 '25 22:02 Geeks-Sid

It can be for sure but the way I see it assumes that these tasks should be executed subsequently. For example:

  • [ ] One can start by enabling llama-3.2-1b in FP16.
  • [ ] Parsing and converting tokenizer from GGUF format to OpenVINO (tokenizer/detokenizer models). After that, we will have core functionality in place.
  • Then, a few tasks can be executed in parallel:
    • [ ] Enable Q8_0 llama
    • [ ] Enable Q4_0 and Q4_1 llama
    • [ ] Enable and verify other llama-based models such as Llama-3.1-8B, SmolLMs
    • [ ] Enable the most popular quantization schemes such as Q4_K_M
    • [ ] Enable Qwen model family

...

AlexKoff88 avatar Feb 04 '25 06:02 AlexKoff88

.take

11happy avatar Feb 12 '25 06:02 11happy

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

github-actions[bot] avatar Feb 12 '25 06:02 github-actions[bot]

Hello @AlexKoff88 I would like to work on this can this assigned to me?

Captain-MUDIT avatar Feb 15 '25 05:02 Captain-MUDIT

Hi @AlexKoff88 ,@ilya-lavrenov , I am able to parse gguf model via using the gguf tools that you provided above howver they are written in C (had to do some linkning on my end) , in the POC you provided in load_gguf_model function the weights is a very complex map containing vector,<T> vector<vector<T>> (layers) should I continue with similar implementation on my end? please point me if I am headed in the right direction. Also I had doubt as you mentioned earlier

Parsing and converting tokenizer from GGUF format to OpenVINO (tokenizer/detokenizer models). After that, we will have core functionality in place.

do we only need to parse the tokenizer not the whole model ?

Thank you

this is the update on my end

main.cpp


extern "C" {
    #include "gguf-tools/gguflib.h"
}
#include "openvino/op/concat.hpp"
#include "openvino/op/constant.hpp"
#include "openvino/op/convert.hpp"
#include "openvino/op/gather_elements.hpp"
#include "openvino/op/unsqueeze.hpp"
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <assert.h>
#include <errno.h>
#include <math.h>
#include <inttypes.h>
#include <iostream>
#include <bits/stdc++.h>

struct {
    int verbose;        // --verbose option
    int diffable;       // --diffable option
} Opt = {0};


std::pair<std::map<std::string,int>,std::map<std::string,std::vector<double>>> load_gguf_model(const char *model_path){
    
    std::map<std::string,int> config,meta_data;
    std::map<std::string,std::vector<double>> weights;
    gguf_ctx *ctx = gguf_open(model_path);
    if (ctx == NULL) {
        perror(model_path);
        exit(1);
    }
    gguf_key key;
    while (gguf_get_key(ctx,&key)) {
        meta_data[std::string(key.name,key.namelen)] = key.val->uint32;
        gguf_next(ctx,key.type,key.val,Opt.verbose);
    }
    config.emplace("layer_num",meta_data["llama.block_count"]);
    config.emplace("head_num",meta_data["llama.attention.head_count"]);
    config.emplace("head_size",meta_data["llama.embedding_length"]/meta_data["llama.attention.head_count"]);
    config.emplace("head_num_kv",(meta_data.count("llama.attention.head_count_kv")?meta_data["llama.attention.head_count_kv"]:meta_data["llama.attention.head_count"]));
    config.emplace("max_position_embeddings",((meta_data.count("llama.context_length")?meta_data["llama.context_length"]:2048)));
    config.emplace("rotary_dims",meta_data["llama.rope.dimension_count"]);
    config.emplace("rms_norm_eps",meta_data["llama.attention.layer_norm_rms_epsilon"]);
    config.emplace("rope_freq_base",((meta_data.count("llama.rope.freq_base")?meta_data["llama.rope.freq_base"]:10000.0)));
    
    for(auto x : config){
        std::cout<<x.first<<" : "<<x.second<<std::endl;
    }
    return{config,weights};
}

int main(int argc, char* argv[]){
    std::cout<<"helloworld]n\n";
    std::string filename = argv[1];
    std::pair<std::map<std::string,int>,std::map<std::string,std::vector<double>>> model = load_gguf_model(argv[1]);

    return 0;
}

make file:

CC=gcc
CXX=g++
CFLAGS=-march=native -ffast-math -g -ggdb -Wall -W -pedantic -O3
INCLUDES=-I./gguf-tools

OBJECTS=gguf-tools/gguflib.o gguf-tools/sds.o gguf-tools/fp16.o

main: $(OBJECTS) main.cpp
	$(CXX) $(CFLAGS) $(INCLUDES) main.cpp $(OBJECTS) -o main

%.o: %.c
	$(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@

clean:
	rm -f main $(OBJECTS)

I am also implementing the model conversion logic as pointed in POC, will update as i finish it. Thank you

11happy avatar Feb 15 '25 14:02 11happy

@Captain-MUDIT would love to collaborate with you, lets wait for the Alex's response & then we can decide how to proceed.

11happy avatar Feb 15 '25 14:02 11happy

Sure @11happy

Captain-MUDIT avatar Feb 16 '25 02:02 Captain-MUDIT

@AlexKoff88 .take

Captain-MUDIT avatar Feb 16 '25 07:02 Captain-MUDIT

Thanks for being interested in this issue. It looks like this ticket is already assigned to a contributor. Please communicate with the assigned contributor to confirm the status of the issue.

github-actions[bot] avatar Feb 16 '25 07:02 github-actions[bot]

Hi @11happy and @Captain-MUDIT, thank you for your interest.

@11happy, regarding your question about how to load the GGUF in the right way. You can go with how MLX does it so you can add "gguf-tool" to submodules and borrow the code from MLX that parses GGUF with "gguf-tool". Details are here: https://github.com/ml-explore/mlx/blob/main/mlx/io/gguf.cpp

@Captain-MUDIT, you can take the tokenizer conversion part. The task is to transform GGUF tokenizer data to OpenVINO tokenizer. OpenVINO has a dedicated project for converting tokenizers from HF Transformers to OpenVINO format: https://github.com/openvinotoolkit/openvino_tokenizers. The idea is to take tokenizer config, vocab and metadata and use a part of openvino_tokenizers lib to do the conversion. Adding @apaniukov for consultations.

AlexKoff88 avatar Feb 18 '25 06:02 AlexKoff88

@AlexKoff88 can I also work on this issue?

janviisonii23 avatar Feb 18 '25 07:02 janviisonii23

@janviisonii23, you can as there are a few subtasks in it but you have to wait a bit until the core part is implemented.

AlexKoff88 avatar Feb 18 '25 07:02 AlexKoff88

@11happy @Captain-MUDIT

There is a TokenizerPipeline class for building tokenizer/detokenizer models. The easiest way is to parse the tokenizer data from .gguf file, make such a pipeline and get models from it, see HF-tiktoken tokenizer example. You can get an example of what steps are created, by checking a steps attribute of the pipeline object that is created here by converting gguf tokenizer using HF AutoTokenizer class. Note that the resulting tokenizer might not accurately represent GGUF tokenizer because each conversion step (GGUF $\rightarrow$ HF $\rightarrow$ OV) might introduce some errors.

The other way is to build the tokenizer by creating the model graph directly, like in this RWKV example.

You might also have to create several base pipelines for different tokenizer types: Image

apaniukov avatar Feb 18 '25 12:02 apaniukov

I started my own implementation as no progress in the past month: https://github.com/openvinotoolkit/openvino.genai/pull/1885

The plan is to design a basic functionality to enable LLama-based models in FP16 and INT8.

AlexKoff88 avatar Mar 11 '25 07:03 AlexKoff88

@11happy @Captain-MUDIT @apaniukov Any update or questions from your side? @p-wysocki In case we don't hear update, should we remark this issue to seek additional contributor? Thanks!

wenjiew avatar Mar 12 '25 05:03 wenjiew

@wenjiew I have been able to load the gguf model on my side referring https://github.com/ml-explore/mlx/blob/main/mlx/io/gguf.cpp also improving my prev implementation, should I make a PR for loading models Thank you

11happy avatar Mar 12 '25 05:03 11happy

@wenjiew The "policy" is that GFIs are under the management of GFI creators - if you wish to seek more contributors or unassign inactive ones - it's your decision. My role is just to browse the issues every now and then to try to bump people/contributors if they're inactive.

p-wysocki avatar Mar 12 '25 12:03 p-wysocki

@wenjiew @p-wysocki if additional contributors are needed, I am happy to help

Wassim-Hamra avatar Mar 12 '25 22:03 Wassim-Hamra

I am almost done with the working prototype in C++ that created inferable OpenVINO IR. It would be great if someone could take a look at how to make OpenVINO tokenizer from the information available in GGUF.

AlexKoff88 avatar Mar 13 '25 13:03 AlexKoff88

@AlexKoff88 Can i take it ?

Wassim-Hamra avatar Mar 13 '25 22:03 Wassim-Hamra

@AlexKoff88 Can i take it ?

Sure, very welcome. You can prototype it in Python before going to C++.

AlexKoff88 avatar Mar 14 '25 05:03 AlexKoff88

Got the accurate results for FP16 SmolLM (llama-based) model in https://github.com/openvinotoolkit/openvino.genai/pull/1885

AlexKoff88 avatar Mar 14 '25 10:03 AlexKoff88

Got the accurate results for Q8_0 as well.

AlexKoff88 avatar Mar 14 '25 12:03 AlexKoff88

Hi @AlexKoff88,

I see that you've already implemented the conversion of the Hugging Face tokenizer to OpenVINO format in the save_tokenizer function in the POC repo. It looks like the first approach (GGUF → HF → OpenVINO) is already prototyped in Python.

Would you like me to proceed with prototyping the second approach (directly converting GGUF → OpenVINO tokenizer) or should we stick with the easier approach for now? Let me know how you'd like to move forward (this is my understanding of the 2 approaches, let me know if I missed something)

Also, just to let you know, I’m currently unable to execute the POC because I’m working on a Windows machine, and my Mac is Intel-based, so the MLX library is incompatible with it (only works with Apple Silicon chips).

Wassim-Hamra avatar Mar 14 '25 16:03 Wassim-Hamra

Hi @Wassim-Hamra,

Also, just to let you know, I’m currently unable to execute the POC because I’m working on a Windows machine, and my Mac is Intel-based, so the MLX library is incompatible with it (only works with Apple Silicon chips).

Have you tried to compile OpenVINO GenAI from the PR that I mentioned? It should work on Mac x86. I personally using Linux x64. No need to install MLX for the PR's implementation. To experiment with Python you can use Google Colab, for example.

Regarding the second workflow, Yes, the idea is to use just GGUF file that contains the tokenizer information, model config, and weights. Here we need to parse the tokenizer information and create the OpenVINO version of the two tokenizer and detokenizer models. In the PR, I already implemented the parser of GGUF and you need to add the tokenizer based part. The entire code should be in C++.

AlexKoff88 avatar Mar 15 '25 05:03 AlexKoff88

Hey folks, since there’s been a lot of development on this issue, I’d like to unassign myself to make room for other interested contributors to jump in and take it forward, @Wassim-Hamra you are open to take this issue. Thank you

11happy avatar Mar 16 '25 06:03 11happy

Hey folks, since there’s been a lot of development on this issue, I’d like to unassign myself to make room for other interested contributors to jump in and take it forward, @Wassim-Hamra you are open to take this issue. Thank you

You can take the conversion of Q4_K_M models as it is the most popular quantization format for GGUF. I added support of Q4_0 and Q4_1 in my PR https://github.com/openvinotoolkit/openvino.genai/pull/1885. My proposal is to upconvert unsupported types to higher precision. For example, Q6 and Q5 to INT8.

AlexKoff88 avatar Mar 17 '25 11:03 AlexKoff88

We also agreed that @apaniukov will prepare a Python-based POC for direct read and conversion of the tokenizer from GGUF to OV Tokenizer library format.

AlexKoff88 avatar Mar 18 '25 09:03 AlexKoff88

Hey folks, since there’s been a lot of development on this issue, I’d like to unassign myself to make room for other interested contributors to jump in and take it forward, @Wassim-Hamra you are open to take this issue. Thank you

You can take the conversion of Q4_K_M models as it is the most popular quantization format for GGUF. I added support of Q4_0 and Q4_1 in my PR #1885. My proposal is to upconvert unsupported types to higher precision. For example, Q6 and Q5 to INT8.

Thank you , I will look into Q4_K_M

11happy avatar Mar 19 '25 08:03 11happy