modelmesh-serving
modelmesh-serving copied to clipboard
Documentation about GPU memory
Thank you very much for the incredible project!
First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.
I was doing several test but I couldn't understand how the following env parameters works: CONTAINER_MEM_REQ_BYTES
and MODELSIZE_MULTIPLIER
. I read the following explanation: https://github.com/kserve/modelmesh/issues/82#issuecomment-1582028690
I applied the following configuration for T4:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
maxLoadingConcurrency: "2"
labels:
app.kubernetes.io/instance: modelmesh-controller
app.kubernetes.io/managed-by: modelmesh-controller
app.kubernetes.io/name: modelmesh-controller
name: modelmesh-serving-triton-2.x-SR
name: triton-2.x
# namespace: inference-server
spec:
builtInAdapter:
memBufferBytes: 134217728
modelLoadingTimeoutMillis: 90000
runtimeManagementPort: 8001
serverType: triton
containers:
- args:
- -c
- 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
"--model-repository=/models/_triton_models" "--model-control-mode=explicit"
"--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
"--allow-sagemaker=false" '
command:
- /bin/sh
image: nvcr.io/nvidia/tritonserver:21.06.1-py3
livenessProbe:
exec:
command:
- curl
- --fail
- --silent
- --show-error
- --max-time
- "9"
- http://localhost:8000/v2/health/live
initialDelaySeconds: 5
periodSeconds: 30
timeoutSeconds: 10
name: triton
env:
- name: CONTAINER_MEM_REQ_BYTES
value: "12884901888"
- name: MODELSIZE_MULTIPLIER
value: "2"
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 1
However, I am seeing models being unloaded and loaded while memory is 2522MiB / 15109MiB
. I don't know why I can't get a higher utilization of gpu.
I saw that probably I was setting the configuration in a bad place: https://github.com/kserve/modelmesh/issues/46#issuecomment-1192388786
I get much better GPU utilization using:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
maxLoadingConcurrency: "2"
labels:
app.kubernetes.io/instance: modelmesh-controller
app.kubernetes.io/managed-by: modelmesh-controller
app.kubernetes.io/name: modelmesh-controller
name: modelmesh-serving-triton-2.x-SR
name: triton-2.x
# namespace: inference-server
spec:
builtInAdapter:
memBufferBytes: 134217728
modelLoadingTimeoutMillis: 90000
runtimeManagementPort: 8001
serverType: triton
env:
- name: CONTAINER_MEM_REQ_BYTES
value: "12884901888" # Works for T4
- name: MODELSIZE_MULTIPLIER
value: "2"
containers:
- args:
- -c
- 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
"--model-repository=/models/_triton_models" "--model-control-mode=explicit"
"--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
"--allow-sagemaker=false" '
command:
- /bin/sh
image: nvcr.io/nvidia/tritonserver:21.06.1-py3
livenessProbe:
exec:
command:
- curl
- --fail
- --silent
- --show-error
- --max-time
- "9"
- http://localhost:8000/v2/health/live
initialDelaySeconds: 5
periodSeconds: 30
timeoutSeconds: 10
name: triton
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 1
Thank you very much for the incredible project!
First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.
I was doing several test but I couldn't understand how the following env parameters works:
CONTAINER_MEM_REQ_BYTES
andMODELSIZE_MULTIPLIER
. I read the following explanation: kserve/modelmesh#82 (comment)I applied the following configuration for T4:
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: annotations: maxLoadingConcurrency: "2" labels: app.kubernetes.io/instance: modelmesh-controller app.kubernetes.io/managed-by: modelmesh-controller app.kubernetes.io/name: modelmesh-controller name: modelmesh-serving-triton-2.x-SR name: triton-2.x # namespace: inference-server spec: builtInAdapter: memBufferBytes: 134217728 modelLoadingTimeoutMillis: 90000 runtimeManagementPort: 8001 serverType: triton containers: - args: - -c - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver "--model-repository=/models/_triton_models" "--model-control-mode=explicit" "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true" "--allow-sagemaker=false" ' command: - /bin/sh image: nvcr.io/nvidia/tritonserver:21.06.1-py3 livenessProbe: exec: command: - curl - --fail - --silent - --show-error - --max-time - "9" - http://localhost:8000/v2/health/live initialDelaySeconds: 5 periodSeconds: 30 timeoutSeconds: 10 name: triton env: - name: CONTAINER_MEM_REQ_BYTES value: "12884901888" - name: MODELSIZE_MULTIPLIER value: "2" resources: limits: nvidia.com/gpu: 1 requests: cpu: 500m memory: 1Gi nvidia.com/gpu: 1
However, I am seeing models being unloaded and loaded while memory is
2522MiB / 15109MiB
. I don't know why I can't get a higher utilization of gpu.
Hi, it has been almost a year since your question but today I came across your question. First of all thank you for your question that I know to do the model sizing. I hope you have solved your problem but if it is not then based on this issue you should place your env variables inside the buildInAdapter, not inside the containers :))