ggml icon indicating copy to clipboard operation
ggml copied to clipboard

Embed yolo files

Open katsu560 opened this issue 1 year ago • 14 comments

Some app like yolov3-tiny needs additional files to execute such as label(coco.names) and alphabet labels(100_0.png, ...) files. If these files are embedded to a model(gguf) file and the app read them from the model file, the app is more portable.

I added below

  • added new GGUF_TYPE_NAMEDOBJECT with name(file path) and value(file body) for adding files to gguf
  • expanded gguf-py to support NAMEDOBJECT, constants.py, gguf_reader.py, gguf_writer.py
    • please see pull request to llama.cpp
  • added gguf-addfile.py script to add files to gguf file
    • add files as NAMEDOBJECT (general.namedobject.N) or add files as NAMEDOBJECT array (general.namedobject[N] with --array option)
  • expanded ggml to support NAMEDOBJECT, ggml.h ggml.c
  • expanded yolov3-tiny to read coco.names and alphabet labels from gguf file,
    • at first read from gguf, then read from file if failed from gguf

NAMEDOBJECT constructed from name(file path) and value(file body)

    struct gguf_nobj {
        uint64_t nname;  // length of name
        char   * name;   // name in utf8
        uint64_t n;      // length of data in bytes
        char   * data;   // data body (file body)
    };

function usage:

struct gguf_nobj gguf_find_name_nobj(const struct gguf_context * ctx, const char * name)

call gguf_find_name_nobj() with const struct gguf_context *ctx and const char *name. ctx is gguf_context pointer. name is string encoded UTF8 like filename. search 'name' NAMEDOBJECT and return struct nobj. if not found, return struct nobj(0, NULL, 0, NULL). so if nobj.n == 0 means 'not found'. if found, return nobj with nobj.name has name, nobj.n has length of nobj.data, nobj.data has byte stream of data.

    struct gguf_nobj nobj = gguf_find_name_nobj(ctx, filename);
    if (nobj.n == 0) {
        return false;
    }
    membuf buf(nobj.data, nobj.data + nobj.n);
    std::istream file_in(&buf);
    if (!file_in) {
        return false;
    }
    std::string line;
    while (std::getline(file_in, line)) {
        labels.push_back(line);
    }

script usage:

python3 gguf-addfile.py [--array] input-gguf-file output-gguf-file files ...
  • add files as NAMEDOBJECT (general.namedobject.N)
  • add files as NAMEDOBJECT array (general.namedobject[N]) with --array option

katsu560 avatar May 19 '24 11:05 katsu560

Is it really necessary to a new type of object to the GGUF format to do this? The file data could be stored either as an array metadata or as a tensor.

slaren avatar May 19 '24 15:05 slaren

I agree with @slaren - don't think it's necessary to introduce named object. But the rest of the idea to embed the data in the GGUF file is nice

ggerganov avatar May 19 '24 15:05 ggerganov

Thanks for prompt checking, @slaren and @ggerganov . If current data structure meet embedding files, I agreed no adding NAMEDOBJECT.

But, I think embedding files need 3 elements, such as path name string(GGUF string 2 part as length and string byte stream), length of data, data stream. I think key string and GGUF_TYPE_STRING has string, length and bytes stream. If we can use key string as path name, we can't embed same name file as existing key names, such as general.name, general.version, tokenizer.chat_template, etc. And someone expects string body has no NULL byte in the way, but as you can see, file body has NULL byte(\0). So, I added new type as NAMEDOBJECT.

katsu560 avatar May 19 '24 17:05 katsu560

You can store the data in an UINT8 array, ~~if you need to store path too you can store it as an array of arrays, ie: [[path, data], [path, data]], though I'm unsure if you would then have to store path as UINT8 too, or if it's allowed to have mixed data?~~ Probably best to store path and data in separate entries.

CISC avatar May 19 '24 18:05 CISC

You can for example store a KV string array with filenames and for each filename have a U8 tensor for each file containing the binary data:

  • "embedded_files": ["my-file.dat", "another-file.bin"]
  • tensors:
    • "my-file.dat"
    • "another-file.bin"
    • ...

ggerganov avatar May 20 '24 06:05 ggerganov

Thanks comments, and sorry for my late response because of my hard working days. I seek another way in this weekend, such as array of array, using tensor data.

katsu560 avatar May 23 '24 23:05 katsu560

finally, I added file data as follows;

  • store the file path as key with starting '/' to avoid from conflicts to other key names. ex. storing file 'data/coco.names' as '/data/coco.names' if storing absolute file path '/a/b/c' as '//a/b/c'
  • store the file contents as GGUF_TYPE_STRING's value.

So, I deleted all NAMEDOBJECT part.

katsu560 avatar May 31 '24 21:05 katsu560

I also removed dump code from gguf-addfile.py script.

this script usage example: python3 gguf-addfile.py path/to/yolov3-tiny.gguf yolov3-tiny-addfiles.gguf data/coco.names data/labels/*

katsu560 avatar May 31 '24 21:05 katsu560

I revised code as to add files to tensor data. I also applied your suggestions.

I try to update ci/run.sh later.

katsu560 avatar Jun 15 '24 16:06 katsu560

I added two functions to ggml.c, gguf_get_tensor_size and gguf_find_key_array. I think it is minimum adding.

katsu560 avatar Jun 22 '24 04:06 katsu560

I also revised ci/run.sh. I added test code to create gguf file and test by yolov3-tiny for reading files from gguf file.

katsu560 avatar Jun 22 '24 19:06 katsu560

I fixed script gguf-addfile.py

  • fix copying key value other than embedded_files
  • refactor code
  • remove unused code
  • check overwriting output file
  • add --force option

katsu560 avatar Jun 23 '24 14:06 katsu560

deleted gguf_find_key_array() and related code from examples/yolo/yolov3-tiny.cpp. please confirm.

katsu560 avatar Jun 25 '24 18:06 katsu560

Thank you for checking the code. I applied minor changes. please check.

katsu560 avatar Jul 13 '24 20:07 katsu560