tevatron icon indicating copy to clipboard operation
tevatron copied to clipboard

Error when trying to train on msmarco-passage.

Open Zaker237 opened this issue 2 years ago • 1 comments

Hi i'am currentlly trying to train a retrievel using tevatron.driver.train but i'm getting an error and i don't jnown how to solve it. hier is the traceback of the error

Traceback (most recent call last):
  File "/home/mboutchouang/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/mboutchouang/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/dstore/home/mboutchouang/tevatron/tevatron/src/tevatron/driver/train.py", line 104, in <module>
    main()
  File "/dstore/home/mboutchouang/tevatron/tevatron/src/tevatron/driver/train.py", line 85, in main
    trainer = trainer_cls(
  File "/dstore/home/mboutchouang/tevatron/tevatron/src/tevatron/trainer.py", line 105, in __init__
    scaler=self.scaler
AttributeError: 'GCTrainer' object has no attribute 'scaler'

please do someone have an idea??

Zaker237 avatar Mar 28 '22 13:03 Zaker237

Hi @Zaker237, seems it is a bug of training using GCTrainer with fp32. A quick fix at your end is adding the --fp16 option in the command line. I create PR to fix it for fp32.

MXueguang avatar Mar 28 '22 16:03 MXueguang

Hi,

It seems that the latest version of HF's trainer will only create a scaler when enabling sharded_ddp https://github.com/huggingface/transformers/blob/fa6107c97edf7cf725305a34735a57875b67d85e/src/transformers/trainer.py#L637

Does this influence the tevatron code? Thanks

kwang2049 avatar Sep 06 '23 22:09 kwang2049

I think now HF's trainer uses the accelerate's scaler https://github.com/huggingface/transformers/issues/25021#issuecomment-1647349987

kwang2049 avatar Sep 11 '23 15:09 kwang2049