bit Problem in reproduce multi-distillation approach

Hello, Thank you for providing code.

I can get the right results of W1A1 with bash scripts/run_glue.sh MNLI (around 77 accuracy on MNLI)

But when i reproduce the W1A1 with multi-distillation approach following (W32A32->W1A2->W1A1), I cannot reproduce the results of W1A2 in paper by simply change abits=1 to abits=2 in scripts/run_glue.sh (The result of W1A2 i get is 80.96/81.36).

Can you share the detail settings of multi-disitillation approach?

Oct 26 '22 03:10 kongds

Hello, I‘ve met the same problem, but I could not get the right results for W1A1 (around 52 accuracy on RTE), and when I try to train W1A2, the result is worse (50%). May I ask if you tried to reproduce RTE?

Nov 30 '22 14:11 TTTTTTris

I don't run RTE. But i have reproduced STS-B. The result of W1A1 is around 67.0 compared to 71.1 in paper.

Nov 30 '22 14:11 kongds

The results of STS-B are 67.7(W1A1 w/o multi-distill), 73.5(W1A2), and 58.0(W1A1 W multi-distill), still lower than the paper. I didn't use data parallel. ------------------ Original ------------------ From: @.>; Date: Wed, Nov 30, 2022 10:59 PM To: @.>; Cc: @.>; @.>; Subject: Re: [facebookresearch/bit] Problem in reproduce multi-distillation approach (Issue #2)

I don't run RTE. But i have reproduced STS-B. The result of W1A1 is around 66.6 compared to 71.1 in paper.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Dec 01 '22 11:12 TTTTTTris

It seems that we cannot reproduce the result of STS-B. The settings of STS-B are: https://github.com/facebookresearch/bit/blob/071a9749e024e8e151c55adbeb6ef3aaf5b8a283/utils_glue.py#L689 According to paper, authors use grid searching to get the result of STS-B

Dec 01 '22 11:12 kongds

Hello, I‘ve met the same problem, I also could not get the right results for W1A1 STS-B (around 68 compared to 71 reported in the paper). May I ask whether you have figured out the reason? @kongds

Mar 29 '23 13:03 NicoNico6

Hi, I still can't get the correct result for W1A1 STS-B and don't know why.

Mar 30 '23 08:03 kongds

That is also difficult for me. I have also tried most W1A2 experiments (with a clear accuracy gap) and want to cite and compare BiT in my paper, but the accuracy gap now really confuses me.

Apr 02 '23 08:04 NicoNico6

I can not get the accuracy shown in the paper in most w1a2 or w1a4 tasks and the accuracy gap is about 10 points.

Apr 02 '23 08:04 TTTTTTris

I can not get the accuracy shown in the paper in most w1a2 or w1a4 tasks and the accuracy gap is about 10 points.

Maybe the released version is not the optimal version.

Apr 02 '23 09:04 NicoNico6

I can reproduce the 1-1-1 BERT for all datasets without multi-distillation. But for 1-1-4 and 1-1-2 BERT, my results are way off. Is anyone @kongds @NicoNico6 @TTTTTTris @likethesky @Celebio getting the same thing?

Apr 12 '23 19:04 Phuoc-Hoan-Le

I can reproduce the 1-1-1 BERT for all datasets without multi-distillation. But for 1-1-4 and 1-1-2 BERT, my results are way off. Is anyone @kongds @NicoNico6 @TTTTTTris @likethesky @Celebio getting the same thing?

Hi, I also found this problem.

Besides, I tried to evaluate the released pre-trained model, but I can not get ACC reported in the README Table. For example, when data augmentation is used, the reported ACC of the released pretrained model is RTE:69.7, MRPC: 88, STS-B: 84.2.

However, I tried running an evaluation based on the released by myself, the corresponding performance is RTE: 66 vs 69.7, MRPC: 85.5 vs 88, STS-B 82.3 vs 84.2.

Did you find the same issue?

Apr 15 '23 09:04 NicoNico6

I can reproduce the 1-1-1 BERT for all datasets without multi-distillation. But for 1-1-4 and 1-1-2 BERT, my results are way off. Is anyone @kongds @NicoNico6 @TTTTTTris @likethesky @Celebio getting the same thing?

Hi, I also found this problem.

Besides, I tried to evaluate the released pre-trained model, but I can not get ACC reported in the README Table. For example, when data augmentation is used, the reported ACC of the released pretrained model is RTE:69.7, MRPC: 88, STS-B: 84.2.

However, I tried running an evaluation based on the released by myself, the corresponding performance is RTE: 66 vs 69.7, MRPC: 85.5 vs 88, STS-B 82.3 vs 84.2.

Did you find the same issue?

Have you tried doing grid search for hyper parameters and see if it works?

Apr 18 '23 15:04 Phuoc-Hoan-Le

bit bit copied to clipboard

Problem in reproduce multi-distillation approach

bit
bit copied to clipboard