llama.cpp
llama.cpp copied to clipboard
llama : tool for evaluating quantization results per layer
Following up on #2421, I think we should implement some better way to observe at which point of the inference the results start to deviate significantly between the classical and quantum models.
So I'm thinking of adding a simple tool that takes as input 2 ggml exported graphs - one classical and one quantum, of the same model. The tool evals both graphs on the CPU using ggml and prints detailed statistical information of the intermediate F32 results after each graph node. For example, each result node which has been given a name will be compared and we'll print stuff like, min, max, avg, var, etc.
I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.
cc @slaren I know you had similar ideas - we can discuss here how to add such support.
Currently I think the ggml graph export/import will be fairly trivial to utilize and will require almost no intervention in the existing llama.cpp implementation. The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.
I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.
I experimented with cutting outliers in the weight distribution to get better weight spread in the quantization. It seemed to work good to measure the standard deviation of the tensor weights (in each normalization block) and cutting all weights that fell outside of about 4 to 5 SDs. I only tested this approach using the same cutting point on all tensors, but I guess the best number of SD's to use as cutting point will depend on the tensor.
The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.
Inplace operations would also need to be disabled, otherwise they will overwrite the result of the previous operation. I am not sure if it is worth keeping the inplace operations at all, they create other issues and if memory usage is a concern, ggml-alloc will already make operations inplace automatically when possible.
@klosax Sure, we can discuss strategies after we have the tool ready.
@slaren Yes, lets for now avoid using those and at some point we can also remove them from the ggml API
I'm currently gathering data regarding quantization quality metrics as indicated in https://github.com/ggerganov/llama.cpp/pull/2657 . I will write up a proper report in the next few days but one of my preliminary findings is that quantization mostly adds noise to the logits which then manifests as higher perplexity due to asymmetry in the calculation. So I think a good method of investigating per-tensor sensitivity to quantization would be to take an unquantized model, add noise to a tensor, and then look at how this changes the perplexity/output logits.
Not stale, though low-prio
This issue was closed because it has been inactive for 14 days since being marked as stale.
This issue was closed because it has been inactive for 14 days since being marked as stale.
This issue was closed because it has been inactive for 14 days since being marked as stale.