Pruner-Zero icon indicating copy to clipboard operation
Pruner-Zero copied to clipboard

Results don't match

Open jeohalves opened this issue 1 year ago • 4 comments

Greetings,

I've done some experiments with PrunerZero and Wanda and saw that there are some results that don't match with the paper. Please find below the obtained results:

Method BoolQ RTE HellaSwag WinoGrande ARC-e ARC-c OBQA Mean
Wanda 76.61% 53.07% 52.70% 67.96% 72.31% 38.91% 30.80% 56.05%
Wanda (norm) 76.61% 53.07% 70.93% 67.96% 69.19% 43.00% 43.20% 60.57%
PrunerZero 70.37% 53.07% 51.15% 66.22% 71.72% 36.69% 28.00% 53.89%
PrunerZero (norm) 70.37% 53.07% 68.90% 66.22% 67.89% 39.08% 40.80% 58.05%

And the results from the paper:

image

I've put in a new line the results from tasks which had a normalized accuracy (red and purple). I only repeated the accuracy for tasks that didnt have it, which are in yellow in the screenshot.

Maybe there was a mistake, which only for PrunerZero the normalized accuracy was reported. Can you guys check it?

Best regards!

jeohalves avatar Jun 25 '24 09:06 jeohalves

Is there any update regarding this issue?

jeohalves avatar Aug 12 '24 19:08 jeohalves

Hi, sorry for the late reply. We employ the higher one from norm and non-norm results.

Basically, your results are the same with ours. Due to the difference in CUDA, GPU, and different devices, there should be some deviation, which seems acceptable.

Here is my recipe:

  • CUDA 12.0
  • Python 3.9
  • A6000

Best regards

pprp avatar Aug 19 '24 02:08 pprp

I'm sorry, but this is clearly wrong. If you use the higher one for Pruner-Zero, you should also apply the same rule for other methods (like Wanda). As we can see, Wanda had a mean of 60.57% using the normalized accuracy. Other works didn't used the normalized accuracy. PrunerZero should be better than using only the magnitude, but it's worse than SparseGPT and Wanda.

jeohalves avatar Aug 20 '24 11:08 jeohalves

Thank you for pointing that out. I will recheck it these days. Maybe using the downstream tasks performance as fitness is a better way.

pprp avatar Aug 20 '24 14:08 pprp

Hi,

Is there some update on this reproducibility issue, since I see that it was closed?

Thanks!

dalistarh avatar Nov 27 '24 07:11 dalistarh

Dear Authors,

Hope this message finds you well. I have tried your code to prune Llama-2-7B by Wanda and Prune-Zero. The evaluation was done by the lm-evaluation-harness packages you provided. I obtain the same results as this issue indicates. Do you have any updates on this issue?

Thanks a lot.

Tangshengku avatar Nov 28 '24 08:11 Tangshengku