h2o-llmstudio
h2o-llmstudio copied to clipboard
Dtype & HF Push Changes
This PR:
- Removes the casting to float32 in case of LORA + float16
- Adds a new mode for pushing to HF which first loads the model on CPU and then shards it across GPUs before pushing - it is probably redundant as it should do the same as CPU, but does not hurt to add it. It might be also the case that the original device is saved as a flag in the weights, which can cause downstream issues and which can be resolved via this method. In best case we would directly load the weights sharded on GPU to never be on CPU, but I did not manage to do this with LORA merging etc.