Pruner-Zero Results don't match

Greetings,

I've done some experiments with PrunerZero and Wanda and saw that there are some results that don't match with the paper. Please find below the obtained results:

Method	BoolQ	RTE	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA	Mean
Wanda	76.61%	53.07%	52.70%	67.96%	72.31%	38.91%	30.80%	56.05%
Wanda (norm)	76.61%	53.07%	70.93%	67.96%	69.19%	43.00%	43.20%	60.57%
PrunerZero	70.37%	53.07%	51.15%	66.22%	71.72%	36.69%	28.00%	53.89%
PrunerZero (norm)	70.37%	53.07%	68.90%	66.22%	67.89%	39.08%	40.80%	58.05%

And the results from the paper:

I've put in a new line the results from tasks which had a normalized accuracy (red and purple). I only repeated the accuracy for tasks that didnt have it, which are in yellow in the screenshot.

Maybe there was a mistake, which only for PrunerZero the normalized accuracy was reported. Can you guys check it?

Best regards!

Jun 25 '24 09:06 jeohalves

Is there any update regarding this issue?

Aug 12 '24 19:08 jeohalves

Hi, sorry for the late reply. We employ the higher one from norm and non-norm results.

Basically, your results are the same with ours. Due to the difference in CUDA, GPU, and different devices, there should be some deviation, which seems acceptable.

Here is my recipe:

CUDA 12.0
Python 3.9
A6000

Best regards

Aug 19 '24 02:08 pprp

I'm sorry, but this is clearly wrong. If you use the higher one for Pruner-Zero, you should also apply the same rule for other methods (like Wanda). As we can see, Wanda had a mean of 60.57% using the normalized accuracy. Other works didn't used the normalized accuracy. PrunerZero should be better than using only the magnitude, but it's worse than SparseGPT and Wanda.

Aug 20 '24 11:08 jeohalves

Thank you for pointing that out. I will recheck it these days. Maybe using the downstream tasks performance as fitness is a better way.

Aug 20 '24 14:08 pprp

Hi,

Is there some update on this reproducibility issue, since I see that it was closed?

Thanks!

Nov 27 '24 07:11 dalistarh

Dear Authors,

Hope this message finds you well. I have tried your code to prune Llama-2-7B by Wanda and Prune-Zero. The evaluation was done by the lm-evaluation-harness packages you provided. I obtain the same results as this issue indicates. Do you have any updates on this issue?

Thanks a lot.

Nov 28 '24 08:11 Tangshengku