max
max copied to clipboard
[Feature Request] Add persistent cache to `session.load`
What is your request?
I am interested in using Max to speed up boot times in platforms like Relicate and Fly.io, currently session.load seems to compile the model on every call, from the docs:
The first time you load a model, it might take a few minutes to compile it, but this up-front cost will pay dividends in latency savings provided by our next-generation graph compiler.
It would be awesome if you can make this compilation output serializable so you can save it to disk and cache it. For example with `session.load(model, cache='./compiled_model')
This would mean you can create a Dockerfile with the model already compiled so it can boot very fast.
Even better if you make load able to use Direct Memory Access for weights that live on the GPU.
What is your motivation for this change?
Making Max models boot fast would make Max the best way to deploy models in serverless GPU platforms like Replicate and Runpod. I am sure Max could skyrocket its adoption if boot times were good.
Current boot times on these platforms can take minutes.
Any other details?
No response