RAM cache implementation - part II
This PR expands the robustness of the RAM cache implementation. This makes the RAM cache much friendlier to use and avoids users needing to specifically size the cache based on their workflow. It also avoids OOMs in more cases, especially flows with multiple large models. There are three key changes:
1: Loosing the executors pre-emptive cache pin on models so that cached models late in a workflow can be freed to make space for earlier ones 2: pre-emptively freeing space for large models on load 3: freeing space on demand during the GPU -> RAM weight offload process
Example test conditions:
Linux, RTX5090, swapoff, 96GB RAM Workflow: Flux FP16 -> qwen FP16 -> wan 2.2 FP16 giant-flow.json
In the screenshot its executing wan. The RAM trace shows it dropping down from 95% to make space for wan after qwen.
On rerun it still has all the text encodings for re-use.
Hey, is this PR in a state where it can be taken off draft + reviewed, or is it still in the oven?
Hey, is this PR in a state where it can be taken off draft + reviewed, or is it still in the oven?
Hey, we are stuck on draft as it conflicts with async offloading and I will need to do a small rebase and retest. Feel free to review though.
There's a bug where I enable ram cache on simulated 50GB ram + 24 vram.
I run this workflow twice in a row:
It unloads the high noise model on the first workflow run which is good but the second time it gets stuck on the first sampler node.
There's a bug where I enable ram cache on simulated 50GB ram + 24 vram.
I run this workflow twice in a row:
It unloads the high noise model on the first workflow run which is good but the second time it gets stuck on the first sampler node.
should be able to unload low noise model on the second run properly now.
