Qingyou Meng comments

Results 47 comments of


                                            Qingyou Meng

fix typo in comment

> rebase error Would you please fix the "Succesfully" typo in last line of quantize.py as well?

sha256 check sums to verify original and converted model data

Good job, thanks a lot :) I personally recommend renaming `SHA256SUMS` as `models.sha256`, with the following reasons: * Prefix with `models`, so it's clear that this file belongs to models....

sha256 check sums to verify original and converted model data

I can confirm `ggml-model-q4*` in 7B and 13B mismatch that in SHA256SUMS. Local checksums: - b85058443e89dabdf674d5018d979f0d682977f8413f05b5fd235d36d7a8ff82 models/7B/ggml-model-q4_0.bin - d3ab1548a2d19d989c1e7ea1130ab8c6300c75941a434a2d333ef564f989131b models/13B/ggml-model-q4_0.bin - 38b705ce6c5baba4bb6056f11189a4ad21b70591258422450e30495b6ccd8521 models/13B/ggml-model-q4_0.bin.1 Other files in 7B and 13B are...

[Proof of concept] threading: preemptive, local/global

> Few observations, without looking into details: > > * The time/token on M1 is the same, but the CPU usage is also the same as busy loop (i.e. CPU...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

CMakeFiles does not work, perhaps should move mulmat-tune.[c,h] to root dir.

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

> > CMakeFiles does not work, perhaps should move mulmat-tune.[c,h] to root dir. > > I think so. It is seems to be another part of ggml, so I would...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

> I was thinking recently that better threading would be nice to have. > > Anyways, I didn't yet look at the PR in detail but I can already give...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

I'll try fix the CMake build. I'm not familiar with it, so will reference the configuration of ggml-opencl.

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

> Is it optional? Because ggml-opencl is optional. As far as I know, `ggml-opencl` is controlled by a compile flag named`LLAMA_OPENCL`, while `mulmat-tune` doesn't have any compile flags at present....

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

@SlyEcho I just tried 3B, it's amazing fast than 7B! Thanks! BTW, the mulmat-tune tool supports 3B now. I also added an env named `LLAMA_MULMAT_TUNE_DATA_DIR` for ease of switching between...