Edward Beeching
Edward Beeching
Thanks @tcapelle zero3 shards the optimizer state, grads and model weights across GPUs. So you should have more memory available. However, if you are tuning a 7b model you may...
This is so the config is compatible with a larger model, e.g. llama-2-70b. I think that for a 7b model no sharding will take place.
Hi, is the repo you are referring to this one or another one? Since your question was not clear about this.
This is probably related to flash attn being disabled and the large prompt limit of 4096. Are you using deepspeed? Do the yi models not support flash-attn?
Thanks for all your questions and detailed analysis, there are a number of different things to address here. ### LoRA training: The official `zephyr-7b-beta` model used full training. We provide...
We identified the regression using MT bench scores as well. We are rerunning the evals and other exps internally to try and get to the root cause.
Hi @wlhgtc In fact `zephyr-7b-beta` was trained using an internal codebase. `zephyr-7b-dpo-full` was trained using code from this repo, with the same parameters as the internal codebase. This repo contains...
Hi @liutianlin0121 , sorry for the lack of updates. I have been cautiously working through PRs on our internal codebase to identify the root cause. I can confirm that I...
Hi @patchie , thanks for your question. Unfortunately, this sort of question is beyond the scope of the alignment handbook. Perhaps you / your company would be interested in talking...
Hi @daehuikim. We did not consider this use case. Are you unable to push there dataset to the hub, even as a private dataset? Otherwise you would need to use...