Clara Pohland comments

Results 19 comments of


                                            Clara Pohland

[KTO]: Fix nan losses and crashing job

btw. I also tested it on multiple GPUs now. It is running and both GPUs show usage, but the distribution does not seem to be effective as there is no...

[KTO]: Fix nan losses and crashing job

@kashif @lewtun: is there anything more I should validate or can you approve this for now? I am looking further into the multi GPU case atm, but would prefer to...

[KTO]: Fix nan losses and crashing job

@kashif for me https://github.com/huggingface/trl/pull/1476 has a few issues, (see comments there). Except from this: - removing the interleaving of datasets works for me and is a useful change - using...

[KTO]: Fix nan losses and crashing job

@kashif for the loss `accelerate` handles that for us, yes. But this is just for the rewards/logp that are stored as metrics to be logged. In my understanding, that's why...

[KTO]: Fix nan losses and crashing job

@kashif, I applied the suggested changes from the code style pipeline. Should be green again now

[KTO]: Fix nan losses and crashing job

> @johncordeiro could you try the version here and see if you're sill experiencing hanging? https://github.com/kawine/trl if so, more context would be helpful > > @claralp thank you for all...

[KTO]: Fix nan losses and crashing job

@johncordeiro do you have `prediction_loss_only` enabled in your evaluation step? The logits were not propagated, so this part was not working in the previous version. Another reason could be an...

[KTO]: Fix nan losses and crashing job

@kawine it uses `.tolist()` for all metrics, which moves them to CPU. The `log_metrics` of HF Trainer is not used by KTO Trainer, it uses its own subclassed `log` function

[KTO]: Fix nan losses and crashing job

> @claralp if i add the line `metrics[f"device"] = torch.Tensor([float(str(self.args.device)[-1])]).cpu()` to `get_batch_loss_metrics`, i can see in wandb that the value is always 0 (i.e., the main process), suggesting that only...

[KTO]: Fix nan losses and crashing job

> still getting the same thing with `metrics[f"device"] = torch.Tensor([float(str(self.accelerator.process_index)[-1])]).cpu()` and the latest version of accelerate (0.28.0) @kawine, I just tested this with `per_device_batch_size=2` on 2 GPUs. When printing the...