RL4LMs
RL4LMs copied to clipboard
Top-K and Top-p sampling
Hi, thanks for your great work!
I have a question about the sampling process. When both top-K and top-p are enabled (e.g., https://github.com/allenai/RL4LMs/blob/main/scripts/training/task_configs/common_gen/t5_nlpo.yml#L44-L51), isn't top-p just ignored because the K most likely next words are filtered and the probability mass is redistributed among only those K next words? Please correct me if my understanding is wrong. Thank you!
This top p mask is quite different from typical top-p sampling. This is particular to NLPO algorithm. Before sampling, we generate a top p mask from the mask policy (a copy of policy from previous epochs). Depending on generation kwargs, top k is applied on top of this. For details, you can refer to our paper.