New code utils
@vwxyzjn do we still need this PR? Not sure if its additional stuff to the code tool or not.
Ok, now that we merged in the PR which added async-by-default verifiers and configureable verifiers we're able to (finally) cleanly merge in the code verifier. There's code in setup_ray_node.sh which spins up a load-balanced code-exeuction server locally on the training machine and then writes the endpoint to an env variable which is read by the training script. This gives us super reliable code-execution during training which won't falter regardless of how many training jobs we're doing in parallel.
Side note: there was support for some weirdness with passing a list of dataset for a single training instance. I did a quick scan of our datasets and didn't see any place where that was the case so I gutted that out to simplify the code. Lemme know if that messes stuff up. It's in apply_verifiable_rewards in model_utils.py
^^^ EDIT: I was being dumb, I didn't realize it was for applying multiple verifiers. I restored that. Just needed a rework in how i was storing the code data, NBD