llama.cpp convert-*.py: autogenerate general.uuid if missing

This PR was split from https://github.com/ggerganov/llama.cpp/pull/7499 as this required more thought before being included.

But basically the idea is if UUID is not included, we may want to automatically generate a UUID that is deterministically based on the tensor data (so that regenerating the file will give the same hash).

At the moment when generating a new gguf model file it will add this extra console line

INFO:hf-to-gguf:generating general.uuid     86b1ebff-d754-50fc-9245-d23fe329817c

Then when you dump the gguf you can see its in the kv store

     13: STRING     |        1 | general.uuid = '86b1ebff-d754-50fc-9245-d23fe329817c'

Just note that this method won't detect models that are semantically the same but quantized differently.

[x] I have read the contributing guidelines
Self-reported review complexity:
- [X] Low
- [ ] Medium
- [ ] High

Jul 18 '24 11:07 mofosyne

@compilade I recall you had an observation about potential issues with autogenerating uuids (Also add me to any PR relating to authorship metadata that you said you may want to adjust as well)

Jul 18 '24 20:07 mofosyne

@compilade I recall you had an observation about potential issues with autogenerating uuids

@mofosyne Yes, there are possible problems.

Should the UUID of a model be the same if it's converted to f32, f16, bf16, or q8_0?
- Why or why not?
Should llama-quantize keep the UUID intact? (I think that yes)
- Is this consistent with the previous point?

I would like to keep the equivalence of convert_hf_to_gguf.py --outtype q8_0 with llama-quantize model-f32.gguf model-q8_0.gguf q8_0.

Jul 19 '24 04:07 compilade

@compilade Well with the current technique of just hashing all the tensors... that's not quite possible at the moment.

Also source models can now be tracked by general.source.uuid so might not be quite an issue anymore.

I think if people here wants to do the 'uuid' referring to semantic model id... then maybe we could do a copy of the id if converting? But if generating a new model from scratch... finetuning... or merging models then generate a new uuid.

Anyway one argument against this, is that the process of 'down conversion' will lead to a difference in response, so while it is derived from the same single parent as a converted model... it's still a distinct model itself.

Ultimately, my conceptization with this... is that eventually we will be able to autogenerate a tree of life. And you could argue that the 'down converted' models are the leaves of each model branch? (Be nice if someone could create a visualization too)

Jul 19 '24 08:07 mofosyne

I think you got a point about the lazy loading nature of this script and how this will cause problems @compilade

Perhaps this is more of an argument to close this PR and figure out a different approach to uuid generation.

E.g. Maybe we could mandate that on generation of any new models, that a UUIDv7 or UUIDv4 code is generated for it. But for conversion of models, we would only do copies of models (or if we deem a quantized version to be a new model, it would be a UUIDv5 hash of the source model UUID). Unsure what to do if source model lacks an ID, maybe don't generate an id?

Jul 24 '24 13:07 mofosyne

@mofosyne Hashing the source tensors could work without making the memory usage too high (because they are mmap-ed), and would also solve the other equivalence problems, since the semantic of the UUID would be about where it came from, so llama-quantize can leave it unchanged.

The CPU overhead of hashing might make conversion slower, though, since it's all single-threaded, and the i/o operations are blocking (nothing else is done when reading and when writing).

Jul 24 '24 14:07 compilade

@compilade you mean like generate_source_tensors_uuid() in this? (I've set to draft and added generate_source_tensors_uuid() just for illustrative purpose).

For the 'source', I found I can't just straight up hash pytorch but had to convert it into a numpy format first. I've added a 64 type, to at least capture any larger pytorch values (unless it makes sense to stick to f32 or f16).

I've noticed that setting the output to f32 doesn't have the same uuid, even when I set data_torch.to(torch.float32).squeeze().numpy() in generate_source_tensors_uuid() so unsure about whats going on here.

Still inclined to call it quits and close this PR unless there is actually a working solution I can think if. Still feels like the best approach is just to tell model makers to generate a random UUID when they create their model and about to distribute it. (e.g. maybe add a --publish flag for any training program which would then generate a random UUID for it?)

Jul 26 '24 16:07 mofosyne

you mean like generate_source_tensors_uuid() in this?

@mofosyne Yes, pretty much. This reads the whole source tensors twice (so it's slow), but I don't really see a way around that considering metadata is written before the tensor data.

For the 'source', I found I can't just straight up hash pytorch but had to convert it into a numpy format first.

This is not a problem, because using data_torch.numpy() shares the same memory, even if the source is mmap-ed.

I've noticed that setting the output to f32 doesn't have the same uuid, even when I set data_torch.to(torch.float32).squeeze().numpy() in generate_source_tensors_uuid() so unsure about whats going on here.

No need to change the type when hashing, this makes it impossible to directly use mmap-ed tensors. But since the tensor objects are not kept afterwards, either way could still work without using too much memory. (Also, squeeze doesn't affect the data, only the shape, so it's not necessary here.) I also think ignoring tensors is not necessary for the purpose of hashing the source.

It's normal that it's not resulting in the same UUID when converting to f32 vs when keeping the original type because you're giving different bits to the hashing function.

Still inclined to call it quits and close this PR unless there is actually a working solution I can think of.

While this would work, there's the overhead of reading all the tensors twice which is hard to avoid. Making conversion twice as slow on low-RAM systems isn't desirable. If we can think of a solution around that, this would be more useful.

Still feels like the best approach is just to tell model makers to generate a random UUID when they create their model and about to distribute it. (e.g. maybe add a --publish flag for any training program which would then generate a random UUID for it?)

Ideally, yes, but in practice, this would be very hard to enforce, and I'm pretty sure the big model makers like Meta and Mistral will totally ignore this because they're likely using custom training frameworks. (And/or using special-purpose frameworks like states-spaces/mamba to train mamba-codestral-7B-v0.1, which (at least on release) doesn't even have the standard config.json)

Alternatively, it might be possible to get hashes of the models files directly from Git, but this would not give the same result for pytorch_model.bin vs model.safetensors of the same model unlike when hashing the actual tensors.

Jul 26 '24 18:07 compilade

@compilade yeah I just spotted what you meant. For now regarding bf16 having no direct mapping, I've just cast it upwards. It's compiling. At least this should work as long as the safe tensors are only in float16, float32, float64 or bfloat16

Also this doesn't address the lazy tensor issue... but I'll be happy to try and apply any changes needed if you got suggestions.

Jul 27 '24 16:07 mofosyne

Closing as I cannot think of any good justification for this feature due to potential issues with an autogenerated UUID. Best to make it optional

Nov 09 '24 10:11 mofosyne

llama.cpp llama.cpp copied to clipboard

convert-*.py: autogenerate general.uuid if missing

llama.cpp
llama.cpp copied to clipboard