Taiwei Shi
Taiwei Shi
Agree with @srhthu. I think left padding makes more sense, but the [train.py](https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py) used right padding instead. I think the code they use to train Alpaca is simply not correct...
The code to plot the pie chart is [here](https://github.com/yizhongw/self-instruct)
When I chat with Phi-3-Small, the model often fails to predict the stop token. Perhaps the chat template for Phi-3-small is wrong? Similar issue can be found here: #4712
Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832
> > Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832 > > Are there any...
Disabling torch.compile is useful, as it can also hang PPO training when enabling use_remove_padding. #387
After some debugging, I found that enabling use_remove_padding for critic does not hang the training. Enabling use_remove_padding for actor does. It hangs at [this line](https://github.com/volcengine/verl/blob/99fb2dde7715da1b37f6137e95daee6890dd7866/verl/workers/actor/dp_actor.py#L103). 
After even more debugging, I found that if we modify [```self.compute_entropy_from_logits = torch.compile(verl_F.entropy_from_logits, dynamic=True)```](https://github.com/volcengine/verl/blob/99fb2dde7715da1b37f6137e95daee6890dd7866/verl/workers/actor/dp_actor.py#L56) to ```self.compute_entropy_from_logits = verl_F.entropy_from_logits``` the programs can run with no issues. I also tried setting ```dynamic=False```...
We can now disable torch compile by setting a flag in the config file. #554
This issue might also be the cause of #3881