llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

SHA256 checksums correctness

Open anzz1 opened this issue 1 year ago • 2 comments

Not all of these checksums seem to be correct. Are they calculated with the "v2" new model format after the tokenizer change? PR: https://github.com/ggerganov/llama.cpp/pull/252 Issue: https://github.com/ggerganov/llama.cpp/issues/324

For example, "models/alpaca-7B/ggml-model-q4_0.bin"

v1: 1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13 v2: 8d5562ec1d8a7cfdcf8985a9ddf353339d942c7cf52855a92c9ff59f03b541bc

The SHA256SUMS file has the old v1 hash. Maybe using a naming scheme like "ggml2-model-q4_0.bin" would be good to differentiate between the versions and avoid confusion.

Originally posted by @anzz1 in https://github.com/ggerganov/llama.cpp/issues/338#issuecomment-1478695874

anzz1 avatar Mar 21 '23 23:03 anzz1

I'm still in the process of finding/converting the 7B and 13B alpaca models to ggml2

I'll then recompute all the hashes with the latest build, and also provide a file with the magic numbers and versions for each.

gjmulder avatar Mar 22 '23 11:03 gjmulder

the new ggml file format has the version number 1. calling it ggml2 or "v2" is going to cause confusion. the new file format switched the file magic from "ggml" to "ggmf", maybe we should lean into that.

Green-Sky avatar Mar 22 '23 13:03 Green-Sky

Some checksums (q4_0 and gptq-4b quantizations, new tokenizer format)

ggml-q4-checksums.zip

e: added more checksums

anzz1 avatar Mar 23 '23 07:03 anzz1

Some checksums (q4_0 quantization, new tokenizer format)

ggml-q4_0-checksums.zip

I'd trust your checksums for the alpaca models over mine.

$ cat SHA256SUMS.gary
alpaca-13B-ggml/ggml-model-q4_0.bin: FAILED
alpaca-13B-ggml/params.json: FAILED open or read
alpaca-13B-ggml/tokenizer.model: FAILED open or read
alpaca-30B-ggml/ggml-model-q4_0.bin: OK
alpaca-30B-ggml/params.json: OK
alpaca-30B-ggml/tokenizer.model: FAILED open or read
alpaca-7B-ggml/ggml-model-q4_0.bin: FAILED
alpaca-7B-ggml/params.json: FAILED open or read
alpaca-7B-ggml/tokenizer.model: FAILED open or read
llama-13B-ggml/ggml-model-q4_0.bin: OK
llama-13B-ggml/ggml-model-q4_0.bin.1: OK
llama-13B-ggml/params.json: OK
llama-13B-ggml/tokenizer.model: FAILED open or read
llama-30B-ggml/ggml-model-q4_0.bin: OK
llama-30B-ggml/ggml-model-q4_0.bin.1: OK
llama-30B-ggml/ggml-model-q4_0.bin.2: OK
llama-30B-ggml/ggml-model-q4_0.bin.3: OK
llama-30B-ggml/params.json: OK
llama-30B-ggml/tokenizer.model: FAILED open or read
llama-65B-ggml/ggml-model-q4_0.bin: OK
llama-65B-ggml/ggml-model-q4_0.bin.1: OK
llama-65B-ggml/ggml-model-q4_0.bin.2: OK
llama-65B-ggml/ggml-model-q4_0.bin.3: OK
llama-65B-ggml/ggml-model-q4_0.bin.4: OK
llama-65B-ggml/ggml-model-q4_0.bin.5: OK
llama-65B-ggml/ggml-model-q4_0.bin.6: OK
llama-65B-ggml/ggml-model-q4_0.bin.7: OK
llama-65B-ggml/params.json: OK
llama-65B-ggml/tokenizer.model: FAILED open or read
llama-7B-ggml/ggml-model-q4_0.bin: OK
llama-7B-ggml/params.json: OK
llama-7B-ggml/tokenizer.model: FAILED open or read

gjmulder avatar Mar 23 '23 14:03 gjmulder

the problem with the alpaca models is, that there are alot of different once, by different peoples.

Green-Sky avatar Mar 23 '23 14:03 Green-Sky

the problem with the alpaca models is, that there are alot of different once, by different peoples.

Yes. However we're supporting them, so we need to decide what we can support.

gjmulder avatar Mar 23 '23 15:03 gjmulder

Upvote for @anzz1's new naming convention for the various model subdirs.

gjmulder avatar Mar 23 '23 15:03 gjmulder

@anzz1 why is the tokenizer.model duplicated everywhere, afaik there is only 1

Green-Sky avatar Mar 23 '23 15:03 Green-Sky

@Green-Sky Yeah there is only one, i might be thinking ahead too much. :smile:

also added some more checksums for gptq-4b models above https://github.com/ggerganov/llama.cpp/issues/374#issuecomment-1480719278

anzz1 avatar Mar 23 '23 16:03 anzz1

IMHO, I think we should move the alpaca checksums to a discussion, with a thread for each indiviual model, with source and credits and converted checksums. I don't think we can tame the diverse :llama: hoard otherwise.

Green-Sky avatar Mar 23 '23 16:03 Green-Sky

How about an individual SHA256SUMS.model_type file per model type?

That way we have some granularity and it is self-documenting for new users who don't know a llama from an alpaca.

gjmulder avatar Mar 23 '23 17:03 gjmulder

yes it might be good to differentiate ones as some have short fur and some long and some are more friendly than others. but llamas will always be the llamas and alpacas will be many. llamas are stable, but alpacas are wild cards. i don't see much value in documenting a million different alpaca variations, there should be a standard set to test against but otherwise no point in trying to document every grain of sand at the beach

1 "standard" sum per 1 model type seems to make the most sense. i cant see why they would need to be their own files though, as i'm not big fan of the idea of littering a repo with dozens of files when the same thing can be achieved with dozens of lines in a single file.

i agree this should be moved to discussions as it will be a ongoing thing

anzz1 avatar Mar 23 '23 17:03 anzz1