llama.cpp convert: add tensor hash general.hash.sha256 to kv store

While autogeneration of UUID is a bit controversial, I decided to adapt the logic a bit for a straight up sha256 tensor hash in the kv store as general.hash.sha256.

While there already other choices like xxhash and sha1, in this context I think we have better value with a known strong cryptographic hash method like sha256. This I think would also pave the way for self signed gguf file so you can be sure it came from a known entity.

I also thought about 'per tensor layer' hash, but not sure how useful it would be at this stage as per layer tensor seems to be more of a 'developer debugging tool' at this stage. So best to keep to whole tensor level hashing instead.

For model repo maintainers like huggingface, this has immediate use in being able to track models even when KV metadata has been updated (e.g. fixing authorship metadata).

For anyone who may be interested, you might want to add some logic to either llama-gguf-hash to self check a gguf tensor data if this hash is present in the kv store. I opted against doing it as I wasn't sure on the utility yet and it would be more work than this current PR.

Testing process I did

During conversion you would get this new print out

INFO:hf-to-gguf:blk.7.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:output_norm.weight,         torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:tensor hash (sha256): 8b3e00226cc2a55398b1ffbda7af8464040f9cd7b22ccbef8ba60b227924a2b1
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters

Checked that gguf-dump --markdown I can see the new entry:

|   4 | STRING    |     1 | general.architecture                   | `llama`                                                                          |
|   5 | STRING    |     1 | general.type                           | `model`                                                                          |
|   6 | STRING    |     1 | general.hash.sha256                    | `8b3e00226cc2a55398b1ffbda7af84`...`0f9cd7b22ccbef8ba60b227924a2b1`              |
|   7 | STRING    |     1 | general.name                           | `TinyLLama`                                                                      |
|   8 | STRING    |     1 | general.author                         | `Maykeye`                                                                        |

Checked that the sha256 is consistent with llama-gguf-hash:

llama-gguf-hash --all --no-layer TinyLLama-4.6M-v0.0-F16.gguf
xxh64     cbd383cfd4c897e6  TinyLLama-4.6M-v0.0-F16.gguf
sha1      a9de42f2bbeee1eba49bc39b25cf69ff7a0937f6  TinyLLama-4.6M-v0.0-F16.gguf
sha256    8b3e00226cc2a55398b1ffbda7af8464040f9cd7b22ccbef8ba60b227924a2b1  TinyLLama-4.6M-v0.0-F16.gguf

So at least it appears the sha256 process is consistent. The logic is similar to my attempt at autogenerated UUID which was also consistent, so less likely to have an error creep in this context.

[x] I have read the contributing guidelines
Self-reported review complexity:
- [X] Low
- [ ] Medium
- [ ] High

Jul 23 '24 12:07 mofosyne

@mofosyne

I agree with @Galunid regarding the overhead (both CPU-wise and memory-wise).

This also has the exact same problems as the UUID autogeneration, because the hash for an f32 model quantized (with llama-quantize) to q8_0 would not have the same hash as a model converted with --outtype q8_0, even though the tensor contents are actually equal (this was at least the case in #7234, and this should still be true on master).

If you truly want this to work as an integrity check, then llama-quantize should update the hash otherwise it would never match with the weights of the files most people use.

But the way llama-quantize is structured, this is not easy to do, because it writes the header completely before beginning to quantize the tensors (I think?), so the resulting data is not known beforehand, unless it's all kept in memory.

Another thing is doesn't hashlib need to store all the tensors in memory to calculate the hash?

@Galunid No, hashlib by itself doesn't, because hashing functions usually work in blocks and so only the inner state of the hash needs to be kept in memory.

But in this case, reading the tensor data from a LazyNumpyTensor materializes it, and so yes, this would put all the tensors in memory, since they are only freed when writing them to a file, which is done after writing the metadata. (In GGUFWriter, the tensors are normally only materialized when writing them, since (usually) nothing reads their data before that)

An eventual solution would be to put metadata at the end of GGUF model files, which would also help with editing metadata without rewriting all of the data (good for tokenizer fixes too). But this requires deeper format changes, although it might be possible to fit this backward-compatibly into the existing GGUF v3. (if you have ideas for this, feel free to explore them)

But as this PR is now, I think it has the following problems:

convert_hf_to_gguf.py --outtype q8_0 and llama_quantize model-F32.gguf model-Q8_0.gguf q8_0 would not result in exactly the same files
This would cause a big memory regression for lazy conversion by making it the same as --no-lazy

Jul 23 '24 19:07 compilade

@Galunid the sha256 sum by itself will not consume memory as @compilade said, it does a running hash sum as bytes come into it.

However I see your point regarding impacting lazy loading and I don't see anyway around it, so am inclined to close this PR. Maybe if they really need to, they could just leverage off llama-gguf-hash anyway on load to their database.

@compilade regarding the idea of extending the end of the gguf file as an extension. GG is heavily against the idea unless it is truly unavoidable as he would prefer ensuring backwards compatibility via the kv store. That's not to say it won't happen in the future, but if we do then we better have a good reason... or spin off a new file format standard not encumbered by the past (If so, then I'll suggest using CBOR over inventing our own structure format for metadata... and of course sticking the metadata at the end like you suggest)

convert_hf_to_gguf.py --outtype q8_0 and llama_quantize model-F32.gguf model-Q8_0.gguf q8_0 would not result in exactly the same files

That's a bit strange, does llama-gguf-hash also show difference? Is this a translation between 'safetensor to GGUF Q8' vs 'gguf F32 to GGUF Q8'? If so then maybe it's valid to have a difference, since there might be slight difference in behavior due to difference between converting from two different float formats to Q8?

Jul 24 '24 13:07 mofosyne

@mofosyne

That's a bit strange, does llama-gguf-hash also show difference?

No, the difference is only in the metadata, because of the hash introduced in this PR which differs, because it depends on the output when converting and it's not updated by llama-quantize, so it doesn't reflect the tensor contents in that case.

Another solution (instead of updating the hash in llama-quantize) would be to hash the source tensors when converting, but this would not be usable as an integrity check; it would only mark provenance.

Is this a translation between 'safetensor to GGUF Q8' vs 'gguf F32 to GGUF Q8'? If so then maybe it's valid to have a difference, since there might be slight difference in behavior due to difference between converting from two different float formats to Q8?

There is no difference in behavior. See #7234. Internally, Q8_0 conversion is always done from F32.

Jul 24 '24 14:07 compilade