Firefly icon indicating copy to clipboard operation
Firefly copied to clipboard

qlora ft mixstral 8x7b error

Open dachengai opened this issue 1 year ago • 5 comments

Traceback (most recent call last): File "./train_qlora.py", line 235, in main() File "./train_qlora.py", line 224, in main train_result = trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1854, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2735, in training_step loss = self.compute_loss(model, inputs) File "trainer.py", line 73, in compute_loss loss = super().compute_loss(model, inputs, return_outputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2758, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1513, in forward inputs, kwargs = self._pre_forward(*inputs, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1401, in _pre_forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

dachengai avatar Jan 05 '24 13:01 dachengai

Hello, may I ask if you have solved this problem?

yz-Python avatar Jan 09 '24 08:01 yz-Python

+1 also have this problem

chuxing avatar Feb 23 '24 03:02 chuxing

+1 also have this problem

deeplearningSprint avatar Mar 21 '24 08:03 deeplearningSprint

train.py中:改成 training_args.ddp_find_unused_parameters = True 就可以了

deeplearningSprint avatar Mar 21 '24 09:03 deeplearningSprint

train.py中:改成 training_args.ddp_find_unused_parameters = True 就可以了

After setting training_args.ddp_find_unused_parameters = True, running into this error: RuntimeError: Expected to mark a variable ready only once This happened both in single GPU and multi GPUs.

It's said that setting ddp_find_unused_parameters=false to fix this. It seems like a bug. Can anyone solve it?

popoala avatar Apr 09 '24 12:04 popoala