Wing Lian comments

Results 103 comments of


                                            Wing Lian

Deepspeed not partitioning the model across GPUs

@mariokostelac can you try this branch pls? https://github.com/OpenAccess-AI-Collective/axolotl/compare/deepspeed-low-cpu-mem?expand=1

Deepspeed not partitioning the model across GPUs

One thing to also keep in mind is it's harder to compare that implementation with the HF trainer implementation because that is a more raw implementation https://github.com/huggingface/accelerate/blob/31fd2b1ad6b9c1cd1480568399a311b3caaf62dc/examples/by_feature/deepspeed_with_config_support.py#L18

Deepspeed not partitioning the model across GPUs

I think you're misunderstanding how deepspeed zero3 works. it will doesn't just simply decrease the VRAM requirements per GPU when you add more GPUs. Did you try loading a larger...

Deepspeed not partitioning the model across GPUs

you may need to set the following in the deepspeed json as well ```json "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, ```

Deepspeed not partitioning the model across GPUs

here's a comparison of a full finetune of Mixtral 8x7B on 8x A6000s with the first 24 layers frozen. The first screen shot is without the offload optimizer and param...

Deepspeed not partitioning the model across GPUs

@mariokostelac There is an upstream fix in transformers that fixes this for Deepspeed Zero 3 now (part of the qlora+FSDP fixes that went out a couple of weeks ago)

DPO training -7B model, OOM on A6000

you need to set one of these to `true` ```yaml load_in_8bit: false load_in_4bit: false ``` and then set `adapter: ` to lora or qlora depending on the one you set...

RuntimeError: PytorchStreamReader failed reading file data/0: invalid header or archive is corrupted

is the issue that the merge doesn't work? or that specifying save_safetensors produces a Pytorch_model.bin? or both?

RuntimeError: PytorchStreamReader failed reading file data/0: invalid header or archive is corrupted

resuming from a "peft checkpoint" is not the same as resuming from a regular checkpoint. You'll want to set `lora_model_dir` to point to the checkpoint directory iirc. @NanoCode012 does that...

Allow usage of DataCollatorForCompletionOnlyLM for SFT Training

Coukd you provide some details about what this collator does differently?