accelerate
accelerate copied to clipboard
Why is `offload_folder` needed for inference of large models?
It does not make sense to me. Why does it load the model first, then offloads some of its weights into files in a different place? Why not just load the weights as needed from the original files or at least have an option to do so?
That's because PyTorch does not let you load an individual weight from a state dict because they pickle the whole thing.
Sounds like it would make sense to distribute larger models in binary format instead of PyTorch format. Right now offload scenarios double the storage requirements due to that issue.
Alternatively, could you make it possible to download the model in PyTorch format, then convert it to binary for accelerate to read directly from "offload_folder"?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.