ggml
ggml copied to clipboard
Embed yolo files
Some app like yolov3-tiny needs additional files to execute such as label(coco.names) and alphabet labels(100_0.png, ...) files. If these files are embedded to a model(gguf) file and the app read them from the model file, the app is more portable.
I added below
- added new GGUF_TYPE_NAMEDOBJECT with name(file path) and value(file body) for adding files to gguf
- expanded gguf-py to support NAMEDOBJECT, constants.py, gguf_reader.py, gguf_writer.py
- please see pull request to llama.cpp
- added gguf-addfile.py script to add files to gguf file
- add files as NAMEDOBJECT (general.namedobject.N) or add files as NAMEDOBJECT array (general.namedobject[N] with --array option)
- expanded ggml to support NAMEDOBJECT, ggml.h ggml.c
- expanded yolov3-tiny to read coco.names and alphabet labels from gguf file,
- at first read from gguf, then read from file if failed from gguf
NAMEDOBJECT constructed from name(file path) and value(file body)
struct gguf_nobj {
uint64_t nname; // length of name
char * name; // name in utf8
uint64_t n; // length of data in bytes
char * data; // data body (file body)
};
function usage:
struct gguf_nobj gguf_find_name_nobj(const struct gguf_context * ctx, const char * name)
call gguf_find_name_nobj() with const struct gguf_context *ctx and const char *name. ctx is gguf_context pointer. name is string encoded UTF8 like filename. search 'name' NAMEDOBJECT and return struct nobj. if not found, return struct nobj(0, NULL, 0, NULL). so if nobj.n == 0 means 'not found'. if found, return nobj with nobj.name has name, nobj.n has length of nobj.data, nobj.data has byte stream of data.
struct gguf_nobj nobj = gguf_find_name_nobj(ctx, filename);
if (nobj.n == 0) {
return false;
}
membuf buf(nobj.data, nobj.data + nobj.n);
std::istream file_in(&buf);
if (!file_in) {
return false;
}
std::string line;
while (std::getline(file_in, line)) {
labels.push_back(line);
}
script usage:
python3 gguf-addfile.py [--array] input-gguf-file output-gguf-file files ...
- add files as NAMEDOBJECT (general.namedobject.N)
- add files as NAMEDOBJECT array (general.namedobject[N]) with --array option
Is it really necessary to a new type of object to the GGUF format to do this? The file data could be stored either as an array metadata or as a tensor.
I agree with @slaren - don't think it's necessary to introduce named object. But the rest of the idea to embed the data in the GGUF file is nice
Thanks for prompt checking, @slaren and @ggerganov . If current data structure meet embedding files, I agreed no adding NAMEDOBJECT.
But, I think embedding files need 3 elements, such as path name string(GGUF string 2 part as length and string byte stream), length of data, data stream. I think key string and GGUF_TYPE_STRING has string, length and bytes stream. If we can use key string as path name, we can't embed same name file as existing key names, such as general.name, general.version, tokenizer.chat_template, etc. And someone expects string body has no NULL byte in the way, but as you can see, file body has NULL byte(\0). So, I added new type as NAMEDOBJECT.
You can store the data in an UINT8 array, ~~if you need to store path too you can store it as an array of arrays, ie: [[path, data], [path, data]], though I'm unsure if you would then have to store path as UINT8 too, or if it's allowed to have mixed data?~~ Probably best to store path and data in separate entries.
You can for example store a KV string array with filenames and for each filename have a U8 tensor for each file containing the binary data:
- "embedded_files": ["my-file.dat", "another-file.bin"]
- tensors:
- "my-file.dat"
- "another-file.bin"
- ...
Thanks comments, and sorry for my late response because of my hard working days. I seek another way in this weekend, such as array of array, using tensor data.
finally, I added file data as follows;
- store the file path as key with starting '/' to avoid from conflicts to other key names. ex. storing file 'data/coco.names' as '/data/coco.names' if storing absolute file path '/a/b/c' as '//a/b/c'
- store the file contents as GGUF_TYPE_STRING's value.
So, I deleted all NAMEDOBJECT part.
I also removed dump code from gguf-addfile.py script.
this script usage example: python3 gguf-addfile.py path/to/yolov3-tiny.gguf yolov3-tiny-addfiles.gguf data/coco.names data/labels/*
I revised code as to add files to tensor data. I also applied your suggestions.
I try to update ci/run.sh later.
I added two functions to ggml.c, gguf_get_tensor_size and gguf_find_key_array. I think it is minimum adding.
I also revised ci/run.sh. I added test code to create gguf file and test by yolov3-tiny for reading files from gguf file.
I fixed script gguf-addfile.py
- fix copying key value other than embedded_files
- refactor code
- remove unused code
- check overwriting output file
- add --force option
deleted gguf_find_key_array() and related code from examples/yolo/yolov3-tiny.cpp. please confirm.
Thank you for checking the code. I applied minor changes. please check.