WizardLM icon indicating copy to clipboard operation
WizardLM copied to clipboard

DeepSpeed Configuration JSON

Open ehartford opened this issue 1 year ago • 6 comments

Please what GPUs did you use to train

Can you please share your deepspeed config json

I can't get finetune to work with the command you gave on your readme.md and with the deepspeed config json in llamax.

I tried 8x 4090 and 4x A100 neither worked

I will need to use the exact hardware and exact hyperparameters and exact deepspeed config file you used.

ehartford avatar Apr 30 '23 07:04 ehartford

Fixed the problem.

Thanks to Rohan, Caseus, and TheBloke for helping with troubleshooting.

Had to remove the following lines from the deepspeed config

    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    

now training with my dataset on 8x A6000's will take 12 hours.

ehartford avatar Apr 30 '23 11:04 ehartford

addressed in PR https://github.com/nlpxucan/WizardLM/pull/17

I was successful with 8x A6000

I also needed environment variable NCCL_P2P_DISABLE=1

ehartford avatar Apr 30 '23 11:04 ehartford

What is the size of the model that you are using for this training code @ehartford ?

Ping-C avatar Sep 17 '23 18:09 Ping-C

This was 5 months ago - that's like 3 years in AI time

ehartford avatar Sep 17 '23 19:09 ehartford

Haha that is true. Just curious, so what size of models could you fit on to 8x A6000? Was that a 70B model or a 30B one? Just wanted to get a sense of what is feasible with this repo.

Ping-C avatar Sep 17 '23 20:09 Ping-C

I do see some DeepSpeed configuration files now in the repository such as: https://github.com/search?q=repo%3Anlpxucan%2FWizardLM%20deepspeed&type=code I also saw you had a PR but it closed. Would you say this issue can be closed?

haiatn avatar Sep 24 '23 19:09 haiatn