sandeep

Results 2 issues of sandeep

Closes #504 This PR adds Direct Policy Optimization as introduced in https://arxiv.org/abs/2305.18290 Loss calculation and concatenated forward pass implementations are adapted from the original TRL library

**Describe the bug** When using gpt4o as the llm and scraping a webpage to return a list of links, sometimes the paths returned are : - relative paths (OR) -...