unsloth
unsloth copied to clipboard
Can we use unsloth to train Reward Models?
More of a question than a bug - will you be working on some examples to use unsloth for training Reward Models - https://huggingface.co/docs/trl/main/en/reward_trainer - as well?
@armsp LoRA and QLoRA for reward models, PPO, DPO etc are all supported - ie anything TRL does, we can do :) But it just needs to be LoRA / QLoRA
@danielhanchen thats amazing...I was just wondering if there are some docs/examples?
@armsp Sadly I don't - I have DPO, but the rest you'll have to read the TRL docs
If i figure it myself maybe I will post it here...meanwhile feel free to close this issue :)
UPDATE:
I got it to work...and there is nothing to it...it just works!!
Fantastic!
@danielhanchen I have observed some quirky behaviors though - for example, for the reward model we only need the following target modules -
target_modules=[
"q_proj",
"v_proj"])
but when I remove the other modules there is an assertion error.
Also, when we initialize the tokenizer, how do we pass arguments for padding and truncation ?
I think I spoke too soon...it completes the training loop but when the trainer goes into the evaluation loop then it errors if the default parameters have been changed - for example num_labels=1 (it is 2 by default) which leads me to believe that somehow the parameters are not being propagated to the code that lies below the abstraction of unsloth.
For example: because of that, the error that comes is -
File "my_venv/lib64/python3.10/site-packages/trl/trainer/utils.py", line 552, in compute_accuracy
accuracy = np.array(predictions == labels, dtype=float).mean().item()
ValueError: operands could not be broadcast together with shapes (9024002,2) (9024002,)
@armsp Oh no :( I'll check again and get back to you - sorry on the issue!
That's great if this works out of the box. I'd be keen to try ORPO with it.
Extreme apologies been extremely busy on my end - so apologies again didn't have time to look at this :(