clearml-serving
clearml-serving copied to clipboard
Could not download model in triton container
Hello!
I use ClearML free (the one without configuration vault stuff) + clearml-serving module
When I spinned docker-compose and tried to pull model from our s3, I've got an error in tritonserver container:
2024-03-13 11:26:56,913 - clearml.storage - WARNING - Failed getting object size: ClientError('An error occurred (403) when calling the HeadObject operation: Forbidden')
2024-03-13 14:26:57
2024-03-13 11:26:57,042 - clearml.storage - ERROR - Could not download s3://<BUCKET>/<FOLDER>/<PROJECT>/<TASK_NAME>.75654091e56141199c9d9594305d6872/models/model_package.zip , err: An error occurred (403) when calling the HeadObject operation: Forbidden
But I've set env variables in example.env (AWS_ ones too) and I could find them in tritonserver container via
$ env | grep CLEARML
$ env | grep AWS
FILES
docker-compose-triton-gpu.yaml
version: "3"
services:
zookeeper:
image: bitnami/zookeeper:3.7.0
container_name: clearml-serving-zookeeper
# ports:
# - "2181:2181"
environment:
- ALLOW_ANONYMOUS_LOGIN=yes
networks:
- clearml-serving-backend
kafka:
image: bitnami/kafka:3.1.1
container_name: clearml-serving-kafka
# ports:
# - "9092:9092"
environment:
- KAFKA_BROKER_ID=1
- KAFKA_CFG_LISTENERS=PLAINTEXT://clearml-serving-kafka:9092
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://clearml-serving-kafka:9092
- KAFKA_CFG_ZOOKEEPER_CONNECT=clearml-serving-zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
- KAFKA_CREATE_TOPICS="topic_test:1:1"
depends_on:
- zookeeper
networks:
- clearml-serving-backend
prometheus:
image: prom/prometheus:v2.34.0
container_name: clearml-serving-prometheus
volumes:
- ./prometheus.yml:/prometheus.yml
command:
- '--config.file=/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
restart: unless-stopped
# ports:
# - "9090:9090"
depends_on:
- clearml-serving-statistics
networks:
- clearml-serving-backend
alertmanager:
image: prom/alertmanager:v0.23.0
container_name: clearml-serving-alertmanager
restart: unless-stopped
# ports:
# - "9093:9093"
depends_on:
- prometheus
- grafana
networks:
- clearml-serving-backend
grafana:
image: grafana/grafana:8.4.4-ubuntu
container_name: clearml-serving-grafana
volumes:
- './datasource.yml:/etc/grafana/provisioning/datasources/datasource.yaml'
restart: unless-stopped
ports:
- "3001:3000"
depends_on:
- prometheus
networks:
- clearml-serving-backend
clearml-serving-inference:
image: allegroai/clearml-serving-inference:1.3.1-vllm
build:
context: ../
dockerfile: clearml_serving/serving/Dockerfile
container_name: clearml-serving-inference
restart: unless-stopped
# optimize perforamnce
security_opt:
- seccomp:unconfined
ports:
- "8080:8080"
environment:
CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-https://app.clear.ml}
CLEARML_API_HOST: ${CLEARML_API_HOST:-https://api.clear.ml}
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-https://files.clear.ml}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY}
CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY}
CLEARML_SERVING_TASK_ID: ${CLEARML_SERVING_TASK_ID:-}
CLEARML_SERVING_PORT: ${CLEARML_SERVING_PORT:-8080}
CLEARML_SERVING_POLL_FREQ: ${CLEARML_SERVING_POLL_FREQ:-1.0}
CLEARML_DEFAULT_BASE_SERVE_URL: ${CLEARML_DEFAULT_BASE_SERVE_URL:-http://127.0.0.1:8080/serve}
CLEARML_DEFAULT_KAFKA_SERVE_URL: ${CLEARML_DEFAULT_KAFKA_SERVE_URL:-clearml-serving-kafka:9092}
CLEARML_DEFAULT_TRITON_GRPC_ADDR: ${CLEARML_DEFAULT_TRITON_GRPC_ADDR:-clearml-serving-triton:8001}
CLEARML_USE_GUNICORN: ${CLEARML_USE_GUNICORN:-}
CLEARML_SERVING_NUM_PROCESS: ${CLEARML_SERVING_NUM_PROCESS:-}
CLEARML_EXTRA_PYTHON_PACKAGES: ${CLEARML_EXTRA_PYTHON_PACKAGES:-}
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
depends_on:
- kafka
- clearml-serving-triton
networks:
- clearml-serving-backend
clearml-serving-triton:
image: allegroai/clearml-serving-triton:1.3.1-vllm
build:
context: ../
dockerfile: clearml_serving/engines/triton/Dockerfile.vllm
container_name: clearml-serving-triton
restart: unless-stopped
# optimize perforamnce
security_opt:
- seccomp:unconfined
# ports:
# - "8001:8001"
environment:
CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-https://app.clear.ml}
CLEARML_API_HOST: ${CLEARML_API_HOST:-https://api.clear.ml}
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-https://files.clear.ml}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY}
CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY}
CLEARML_SERVING_TASK_ID: ${CLEARML_SERVING_TASK_ID:-}
CLEARML_TRITON_POLL_FREQ: ${CLEARML_TRITON_POLL_FREQ:-1.0}
CLEARML_TRITON_METRIC_FREQ: ${CLEARML_TRITON_METRIC_FREQ:-1.0}
CLEARML_EXTRA_PYTHON_PACKAGES: ${CLEARML_EXTRA_PYTHON_PACKAGES:-}
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
depends_on:
- kafka
networks:
- clearml-serving-backend
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
clearml-serving-statistics:
image: allegroai/clearml-serving-statistics:latest
container_name: clearml-serving-statistics
restart: unless-stopped
# optimize perforamnce
security_opt:
- seccomp:unconfined
# ports:
# - "9999:9999"
environment:
CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-https://app.clear.ml}
CLEARML_API_HOST: ${CLEARML_API_HOST:-https://api.clear.ml}
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-https://files.clear.ml}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY}
CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY}
CLEARML_SERVING_TASK_ID: ${CLEARML_SERVING_TASK_ID:-}
CLEARML_DEFAULT_KAFKA_SERVE_URL: ${CLEARML_DEFAULT_KAFKA_SERVE_URL:-clearml-serving-kafka:9092}
CLEARML_SERVING_POLL_FREQ: ${CLEARML_SERVING_POLL_FREQ:-1.0}
depends_on:
- kafka
networks:
- clearml-serving-backend
networks:
clearml-serving-backend:
driver: bridge
example.env
CLEARML_WEB_HOST="[REDACTED]"
CLEARML_API_HOST="[REDACTED]"
CLEARML_FILES_HOST="s3://[REDACTED]"
CLEARML_API_ACCESS_KEY="<access_key_here>"
CLEARML_API_SECRET_KEY="<secret_key_here>"
CLEARML_SERVING_TASK_ID="<serving_service_id_here>"
CLEARML_EXTRA_PYTHON_PACKAGES="boto3"
AWS_ACCESS_KEY_ID="[REDACTED]"
AWS_SECRET_ACCESS_KEY="[REDACTED]"
AWS_DEFAULT_REGION="[REDACTED]"
Dockerfile.vllm:
FROM nvcr.io/nvidia/tritonserver:24.02-vllm-python-py3
ENV LC_ALL=C.UTF-8
COPY clearml_serving /root/clearml/clearml_serving
COPY requirements.txt /root/clearml/requirements.txt
COPY README.md /root/clearml/README.md
COPY setup.py /root/clearml/setup.py
RUN python3 -m pip install --no-cache-dir -r /root/clearml/clearml_serving/engines/triton/requirements.txt
RUN python3 -m pip install --no-cache-dir -U pip -e /root/clearml/
# default serving port
EXPOSE 8001
# environement variable to load Task from CLEARML_SERVING_TASK_ID, CLEARML_SERVING_PORT
WORKDIR /root/clearml/
ENTRYPOINT ["clearml_serving/engines/triton/entrypoint.sh"]
I think this is because https://github.com/allegroai/clearml-serving/blob/main/clearml_serving/engines/triton/triton_helper.py#L140 - it can't download model from s3, because clearml.storage.helper.StorageHelper can't configure _Boto3Driver using only env variables
WORKAROUND
I added clearml.conf file with aws.s3 creds to the root of git repository and fixed my Dockerfile.vllm:
FROM nvcr.io/nvidia/tritonserver:24.02-vllm-python-py3
ENV LC_ALL=C.UTF-8
COPY clearml_serving /root/clearml/clearml_serving
COPY requirements.txt /root/clearml/requirements.txt
COPY clearml.conf /root/clearml.conf
COPY README.md /root/clearml/README.md
COPY setup.py /root/clearml/setup.py
RUN python3 -m pip install --no-cache-dir -r /root/clearml/clearml_serving/engines/triton/requirements.txt
RUN python3 -m pip install --no-cache-dir -U pip -e /root/clearml/
# default serving port
EXPOSE 8001
# environement variable to load Task from CLEARML_SERVING_TASK_ID, CLEARML_SERVING_PORT
WORKDIR /root/clearml/
ENTRYPOINT ["clearml_serving/engines/triton/entrypoint.sh"]
and then I fixed entrypoint.sh:
#!/bin/bash
# print configuration
echo CLEARML_SERVING_TASK_ID="$CLEARML_SERVING_TASK_ID"
echo CLEARML_TRITON_POLL_FREQ="$CLEARML_TRITON_POLL_FREQ"
echo CLEARML_TRITON_METRIC_FREQ="$CLEARML_TRITON_METRIC_FREQ"
echo CLEARML_TRITON_HELPER_ARGS="$CLEARML_TRITON_HELPER_ARGS"
echo CLEARML_EXTRA_PYTHON_PACKAGES="$CLEARML_EXTRA_PYTHON_PACKAGES"
# we should also have clearml-server configurations
if [ ! -z "$CLEARML_EXTRA_PYTHON_PACKAGES" ]
then
python3 -m pip install $CLEARML_EXTRA_PYTHON_PACKAGES
fi
# start service
clearml-init --file /root/clearml.conf && PYTHONPATH=$(pwd) python3 clearml_serving/engines/triton/triton_helper.py $CLEARML_TRITON_HELPER_ARGS $@
Actually I don't know why I faced this issue, I think I did something wrong. In enterprise version we didn't face one because of configuration vault.