llama.cpp Parallel Quantize.sh, add &

@prusnak

./quantize "$i" "${i/f16/q4_0}" 2 &

Mar 13 '23 23:03 tljstewart

The fix need to be more elaborate, because if you pass --remove-f16 then the rm command is called before ./quantize has finished.

Can you come up with a solution that does not have this issue?

Mar 14 '23 08:03 prusnak

This should work:

Yes, this works. But now I realised this completely defeats the purpose of the remove flag. The remove flag is there to save disk space after each conversion has been done. So this means the remove flag only makes sense when processing the files one after each other.

@ggerganov Do you think it makes sense to run the script in parallel by default and switch to serial processing when --remove-f16 is provided or do we want to have a separate orthogonal flag for parallel/serial processing?

Mar 14 '23 16:03 prusnak

ah I see what you mean, swapping disk resources

Mar 14 '23 16:03 tljstewart

I think it is better to multi-thread the quantize.cpp program. Each tensor is divided in n parts and each of the n threads quantizes the corresponding part. This way, even when quantizing the 7B model which has only 1 part, we will utilize all available CPU resources and still gain performance.

If you agree, either reformulate this issue and add "good first issue" tag or create a new one and close this.

Mar 14 '23 19:03 ggerganov

I think it is better to multi-thread the quantize.cpp program.

I agree. This makes sense especially for this reason:

This way, even when quantizing the 7B model which has only 1 part, we will utilize all available CPU resources

If you agree, ...

ACK

FWIW, I really respect your shell skills @tljstewart 👍

Mar 14 '23 20:03 prusnak

Done another way (rewrite to python) in https://github.com/ggerganov/llama.cpp/pull/222

Mar 19 '23 19:03 prusnak

llama.cpp llama.cpp copied to clipboard

Parallel Quantize.sh, add &

llama.cpp
llama.cpp copied to clipboard