Gary Mulder

Results 154 comments of Gary Mulder

122GB. What would be interesting is to benchmark quality versus memory size, i.e. does say a fp16 13B model generate better output than a int4 60GB model? @apollotsantos are you...

This issue is perhaps misnamed, now, as 8bit will likely improve _quality_ over 4bit but not _performance_. In summary: - Inference performance: 4bit > 8bit > fp16 (as the code...

> So far LLAMA version is quite bad at code generation , otherwise quite good . You might want to read the original paper [LLaMA: Open and Efficient Foundation Language...

That isn't surprising, as each thread may be getting its own [random seed](https://en.wikipedia.org/wiki/Random_seed). Changing the number of threads would then change the random seed initialisation, thus generating different output.

fp16 and 4-bit quantized working for me for 30B and 65B models. I haven't run the smaller models: ``` $ uname -a Linux asushimu 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20...

I just pulled the latest code and will regression check the output with all 4-bit models: ``` $ ls -s ./*/ggml* | sort -k 2,2 15886376 ./30B/ggml-model-f16.bin 15886368 ./30B/ggml-model-f16.bin.1 15886392...

Note that as per @ggerganov's correction to my observation in issue #95, the number of threads and other subtleties such as different floating point implementations may prevent us from reproducing...

The conversion and quantization should be deterministic, so if the bin files don't match the pth files won't match: ```` $ md5sum */*pth 0804c42ca65584f50234a86d71e6916a 13B/consolidated.00.pth 016017be6040da87604f77703b92f2bc 13B/consolidated.01.pth f856e9d99c30855d6ead4d00cc3a5573 30B/consolidated.00.pth d9dbfbea61309dc1e087f5081e98331a...

0.3 to 0.5 looks to be better, especially for the smaller models. The "10 simple steps" looks to be a useful prompt to test the each model's ability to count...

I also explored `--top_k` but suspect `--top_k` is currently broken. See issue #56