Grounding_LLMs_with_online_RL Ask a question

Excuse me, I've recently been studying the excellent work of your team and I have a doubt: My understanding of the whole process is that it first goes through step (a), then step (b) which generates the prompt, followed by the LLM in part (c) outputting results (such as what action to take next) through the prompt. By the final step (d), the system calls the PPO algorithm separately for policy generation and at the same time compares it with the output of the LLM. So, I think what the PPO is fine-tuning is actually the output of the LLM, but the description in the paper seems to indicate that the LLM itself was fine-tuned using PPO. This is where I'm not sure. Would you mind clarifying this for me? Thank you very much!

Mar 19 '24 07:03 Curious-L

Hey, the figure indeed summarizes the process from a high level perspective but the details in the paper are what is really happening: we do fine-tune the whole LLM. So, to be precise, given the description returned by the environment and the goal, we construct a prompt (this is hardcoded). Then, we give this prompt to the LLM and compute the log probabilities of all the possible actions to follow this prompt. This is the policy (hence the LLM) and we sample actions according to these log probabilities. After collecting N steps, we compute the PPO loss and fine-tune the whole LLM according to it. Let me know if there is anything still unclear. I can also point you to pieces of code that may help understand what is really happening.

Mar 19 '24 08:03 ClementRomac

Thanks！ Could you kindly inform me about the computational resources required for fine-tuning, including the size of the dataset, tokens, and duration for completing an experiment (across several iterations)? Additionally, I would like to know the minimum resources needed for reproducing experiments on a small scale. Thank you very much!

Mar 19 '24 08:03 Curious-L

Hi,

Details concerning computational resources can be found in the end of Appendix E of our paper: https://arxiv.org/abs/2302.02662.

We did not report the number of tokens and there is no dataset when using GLAM (i.e. Online RL). Hope this helps.

Mar 29 '24 08:03 ClementRomac

Grounding_LLMs_with_online_RL Grounding_LLMs_with_online_RL copied to clipboard

Ask a question

Grounding_LLMs_with_online_RL
Grounding_LLMs_with_online_RL copied to clipboard