pyreft [P1] difficulty in reproducing mrpc results

Hi, I am running on the original code task_steer.py file with the arguments below, but struggling to reproduce the same results on the GLUE mrpc dataset. Are these the correct arguments to use:

epochs: int = 40 lr: float = 0.0003 position: str = "f3" rank: int = 1 dropout: float = 0.05 weight_decay: float = 0.0 warmup_ratio: float = 0.0 reft_intervention: str = "ConditionedSourceLowRankRotatedSpaceIntervention" layers: str = "all" batch_size: int = 32 eval_batch_size: int = 32 accumulation_steps: int = 1 max_grad_norm: float = 1.0 logging_steps: int = 20 max_length: int = 256 seeds: list[int] = [42, 43, 44, 45, 46] task: str = "glue" train_dataset: str = "mrpc" model_name: str = "FacebookAI/roberta-base" test_split: str = "test" metric: str = "accuracy"

Nov 01 '25 21:11 NitayGitHub

Hi Nitay,

Thanks for raising the q. Did you take a look at this thread?. As i mentioned before, GLUE (especially smaller tasks like MRPC) is a little unstable. Can you also post your avg. accuracy on MRPC? Thanks.

Nov 01 '25 22:11 frankaging

I applied the fixes I mentioned here: #177

These are my library versions:

torch : 2.6.0+cu124 torchvision : 0.21.0+cu124 pyvene : 0.1.8 transformers : 4.52.4 protobuf : 3.20.3 matplotlib : 3.10.7 ipywidgets : 8.1.5 plotnine : 0.14.5 huggingface_hub : 0.33.1 numpy : 1.26.4 accelerate : 1.8.1 sentencepiece : 0.2.0 evaluate : 0.4.6 datasets : 3.6.0 wandb : 0.20.1 scikit_learn : 1.2.2 jupyter : ❌ Not installed (PackageNotFoundError) fsspec : 2025.3.0 ydata_profiling : 4.16.1 seaborn : 0.12.2

After running on seeds 42, 43, 44, I got: {'test_accuracy': 0.4831981460023175, 'test_f1': 0.35174418604651164, 'test_combined_score': 0.4174711660244146} {'test_accuracy': 0.7219003476245655, 'test_f1': 0.8070739549839228, 'test_combined_score': 0.7644871513042442} {'test_accuracy': 0.4264194669756663, 'test_f1': 0.19249592169657423, 'test_combined_score': 0.30945769433612025}

Nov 02 '25 09:11 NitayGitHub

I tried running train.py in examples with the same arguments and got more stable scores of {'test_accuracy': 0.8644264194669756, 'test_f1': 0.8997429305912596, 'test_combined_score': 0.8820846750291176} {'test_accuracy': 0.8586326767091541, 'test_f1': 0.8993399339933994, 'test_combined_score': 0.8789863053512768} {'test_accuracy': 0.8679026651216686, 'test_f1': 0.9020618556701031, 'test_combined_score': 0.8849822603958859} {'test_accuracy': 0.828505214368482, 'test_f1': 0.8735042735042735, 'test_combined_score': 0.8510047439363777}

Although this is still not 89 like in the paper

Nov 02 '25 12:11 NitayGitHub

@NitayGitHub interesting. it is hard for us to dig into details at this point (but i would still encourage you to look for potential changes in our codebase that could cause this discrepancy. i will also try to do this). but based on our Table 14 in the appendix, MRPC has a large variance as reported, 89.2(2.62). i would suggest that you can report the avg along with std with your runs, and see if it is 2sd's away of each other.

i also want to point out that GLUE tasks are pretty small, esp. MRPC.

Nov 03 '25 05:11 frankaging