Ma Xinyin

Results 58 comments of Ma Xinyin

Have you tried the copied version of `decapoda-research/llama-7b-hf`, e.g., https://huggingface.co/baffo32/decapoda-research-llama-7B-hf? We would try that kind of checkpoint these days to see if the results are reproducible in those available checkpoints.

I have no idea about this😢. I guess the possible reasons may be: (1) the EOS token issue or (2) the weight between these two is slightly different.

Does that mean that you want to use those weights that have already been pruned?

You can get the position of pruned parameters by replacing [Line 150](https://github.com/horseee/LLM-Pruner/blob/1455fa9646bc8b87ccbc613cf1b97e5729e06152/hf_prune.py#L150) in hf_prune.py by: ``` for group in pruner.step(interactive=True): print(group.details()) group.prune() ``` And the location of the pruned weights...

Not exactly. There's just a minor detail that needs to be corrected. Let's take this example: the down_proj Linear layer has in_features=11008 and out_features=4096, which in PyTorch, would create a...

> I guess I only need to care about things after "=>", right? That's correct. The left side of '=>' serves as the trigger, and the pruning process only affects...

Hi. I'm not sure how you set the 50% sparsity. If you set the pruning ratio to 50% in the command, there are several factors that would cause the parameters...

> I assume the correct way to do it would go something like: 0. (optional) Increase size and topic breadth of LLM-Pruner Corpus > > 1. LLM-Pruner > 2. LoRA/QLoRa...

> @horseee Hi, may I ask you why you don't compare the results bewteen pure quantizing and pure pruning in the paper? Hi. Quantization is orthogonal to pruning and hence...

Hi. We conducted a quick experiment and here are the inference performance: | Pruning Ratio | #Param | Memory | Latency | Speedup | BoolQ | PIQA | HellaSwag |...