onnxruntime_backend
onnxruntime_backend copied to clipboard
Shared weights whenever multiple instances
Also CPU
any progress?
+1 for this. Did some benchmarking on this today.
This is with 1 instance of 3 ONNX models
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 74% 64C P2 235W / 350W | 4416MiB / 24576MiB | 40% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2054 G /usr/lib/xorg/Xorg 8MiB |
| 0 N/A N/A 2270 G /usr/bin/gnome-shell 6MiB |
| 0 N/A N/A 1498503 C tritonserver 4397MiB |
+-----------------------------------------------------------------------------+
This is with 2 instances of 3 ONNX models
Tue Jul 12 15:19:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 73% 65C P2 236W / 350W | 7238MiB / 24576MiB | 50% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2054 G /usr/lib/xorg/Xorg 8MiB |
| 0 N/A N/A 2270 G /usr/bin/gnome-shell 6MiB |
| 0 N/A N/A 1653466 C tritonserver 7219MiB |
+-----------------------------------------------------------------------------+
CC @pranavsharma does ORT provides API for doing so? Or can a ORT session be run for different inferences in parallel?
CC @pranavsharma does ORT provides API for doing so? Or can a ORT session be run for different inferences in parallel?
Not fully following. What API are you looking for? I believe Triton already creates a separate session for each instance and these instances (sessions) can be used to run inferences in parallel. The drawback is that each session has its own copy of the weights thereby increasing (replicating) the memory consumption. Someone has submitted code changes to share a session between different instances. We're reviewing the changes. This should fix the memory consumption problem.
Someone has submitted code changes to share a session between different instances. We're reviewing the changes. This should fix the memory consumption problem.
Yes, this is what I was looking for. Sorry for not being clear in my previous question, just mumbling different ways to use a copy of weights in multiple instances that I have seen in different framework. i.e. TRT store weights in an "engine" and it can creates multiple "context" maps to the same "engine"
@pranavsharma any progress about "Sharing a session between different instances of ONNXRuntime" ?
@pranavsharma any progress about "Sharing a session between different instances of ONNXRuntime" ?
I should be able to get to it this week.
@pranavsharma any progress about "Sharing a session between different instances of ONNXRuntime" ?
I should be able to get to it this week.
GOOD! I look forward to hearing from you soon.
@pranavsharma any progress?
Is there any news about sharing gpu memory? I the PR you mentioned #141, @pranavsharma ?
We have to switch models regularly and sharing memory would be very beneficial.