Wing Lian
Wing Lian
@mariokostelac can you try this branch pls? https://github.com/OpenAccess-AI-Collective/axolotl/compare/deepspeed-low-cpu-mem?expand=1
One thing to also keep in mind is it's harder to compare that implementation with the HF trainer implementation because that is a more raw implementation https://github.com/huggingface/accelerate/blob/31fd2b1ad6b9c1cd1480568399a311b3caaf62dc/examples/by_feature/deepspeed_with_config_support.py#L18
I think you're misunderstanding how deepspeed zero3 works. it will doesn't just simply decrease the VRAM requirements per GPU when you add more GPUs. Did you try loading a larger...
you may need to set the following in the deepspeed json as well ```json "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, ```
here's a comparison of a full finetune of Mixtral 8x7B on 8x A6000s with the first 24 layers frozen. The first screen shot is without the offload optimizer and param...
@mariokostelac There is an upstream fix in transformers that fixes this for Deepspeed Zero 3 now (part of the qlora+FSDP fixes that went out a couple of weeks ago)
you need to set one of these to `true` ```yaml load_in_8bit: false load_in_4bit: false ``` and then set `adapter: ` to lora or qlora depending on the one you set...
RuntimeError: PytorchStreamReader failed reading file data/0: invalid header or archive is corrupted
is the issue that the merge doesn't work? or that specifying save_safetensors produces a Pytorch_model.bin? or both?
RuntimeError: PytorchStreamReader failed reading file data/0: invalid header or archive is corrupted
resuming from a "peft checkpoint" is not the same as resuming from a regular checkpoint. You'll want to set `lora_model_dir` to point to the checkpoint directory iirc. @NanoCode012 does that...
Coukd you provide some details about what this collator does differently?