Re-use intermediate buffers
Is your feature request related to a problem? Please describe.
Currently, WONNX will allocate a buffer for each operator output. This output buffer is then read by at least one subsequent operator. After the output has been read by all operators that use it as input, it is not used any longer, but are not deallocated until the 'Session' is dropped (they will be re-used in future inferences). These buffers take up GPU memory, and because GPUs do no swapping as far as I know, they limit the maximum size of a model we can use.
(Note, I am on a MacBook M1 Max with 64 GB memory shared between CPU-GPU so have not run into this issue myself yet)
Describe the solution you'd like
Pre-allocating buffers is desirable to ensure inference is fast. This means we should not deallocate buffers after we're done with them (however we'd then also would have to allocate them at inference time).
As many models are very 'deep', it is very much possible to pre-allocate a smaller number of buffers and re-use these. A simple example graph:
Input -> A -> B -> C -> Output
In the above, we currently allocate for Input and outputs of A, B and C. If the output for C fits in the output buffer of A, we could simply reuse A's output buffer for C's output: after B is done reading A's output it will never be used anyway (B must use its own output buffer, as it is still reading from A's output buffer).
A more complicated example:
Input -> A
A -> B -> C
A -> D -> E
C + E -> Output
In this case, the output of 'A' is used by both B and D, and can only be re-used after both B and D have executed.
This should be fairly easy to implement by maintaining some sort of 'buffer pool' while sequencing the DAG into GPU operations, and calculating the minimum number and sizes of buffers that should be allocated. This should have some sort of look-ahead to allocate a bigger buffer if an operator further in the graph needs it (so it can be shared with an 'earlier' operator that requires a smaller buffer)
Describe alternatives you've considered
That would be one of (1) buying a larger GPU, (2) use smaller models only or (3) implement some sort of swapping...
(I might be able to implement this later on)
So, to avoid conflict as you mentioned, I think that we just need to implement a graph coloring algorithm, with the color representing the buffers.
So, to avoid conflict as you mentioned, I think that we just need to implement a graph coloring algorithm, with the color representing the buffers.
I am not a CS grad (you might have noticed ;-)) but this seems like the right thing to do. After coloring we then allocate each color's buffer with max(...buffer sizes requested for this color...) bytes.
No worries. Actually, Coloring does not work straight out of the box. But yeah. Something along that line.
Closed by #143