Changes to support TorchServe on cpu & gpu
*What is the PR about
This PR is for integrating TorchServe with this solution
- Supports CPU & GPU
- Tested with
./test.sh run bmk
From UX POV, User needs to change model_server=torchserve in config.properties. Rest of the flow is the same.
Currently, this is supported for CPU only
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
CPU Logs
kubectl logs bert-base-workshop-0-7kmgk -n mpi
/app/tests /home/model-server
Configuring number of model servers from config.properties ...
Number of model servers (1) configured from environment ...
Namespace(url='http://bert-base-multilingual-cased-cpu-[INSTANCE_IDX].mpi.svc.cluster.local:8080/predictions/model[MODEL_IDX]', num_thread=2, latency_window_size=1000, throughput_time=180, throughput_interval=10, is_multi_instance=True, n_instance=1, is_multi_model_per_instance=True, n_model_per_instance=1, post=True, verbose=True, cache_dns=True, model_server='torchserve')
caching dns
http://bert-base-multilingual-cased-cpu-0.mpi.svc.cluster.local:8080/predictions/model0
http://10.100.115.203:8080/predictions/model0
<Response [200]>
{'pid': 6, 'throughput': 0.0, 'p50': '0.000', 'p90': '0.000', 'p95': '0.000', 'errors': '0'}
{}
{}
{'pid': 6, 'throughput': 5.7, 'p50': '0.358', 'p90': '0.402', 'p95': '0.414', 'errors': '0'}
{'p90_0_0': '0.402'}
{'num_0_0': 57}
{'pid': 6, 'throughput': 5.9, 'p50': '0.318', 'p90': '0.392', 'p95': '0.412', 'errors': '0'}
{'p90_0_0': '0.392'}
{'num_0_0': 116}
{'pid': 6, 'throughput': 6.4, 'p50': '0.311', 'p90': '0.387', 'p95': '0.409', 'errors': '0'}
{'p90_0_0': '0.387'}
{'num_0_0': 180}
GPU logs
kubectl logs bert-base-workshop-0-7l2rg -n mpi
/app/tests /home/model-server
Configuring number of model servers from config.properties ...
Number of model servers (1) configured from environment ...
Namespace(url='http://bert-base-multilingual-cased-gpu-[INSTANCE_IDX].mpi.svc.cluster.local:8080/predictions/model[MODEL_IDX]', num_thread=2, latency_window_size=1000, throughput_time=180, throughput_interval=10, is_multi_instance=True, n_instance=1, is_multi_model_per_instance=True, n_model_per_instance=1, post=True, verbose=True, cache_dns=True, model_server='torchserve')
caching dns
http://bert-base-multilingual-cased-gpu-0.mpi.svc.cluster.local:8080/predictions/model0
http://10.100.120.85:8080/predictions/model0
<Response [200]>
{'pid': 6, 'throughput': 0.0, 'p50': '0.000', 'p90': '0.000', 'p95': '0.000', 'errors': '0'}
{}
{}
{'pid': 6, 'throughput': 92.5, 'p50': '0.021', 'p90': '0.025', 'p95': '0.027', 'errors': '0'}
{'p90_0_0': '0.025'}
{'num_0_0': 925}
{'pid': 6, 'throughput': 98.3, 'p50': '0.020', 'p90': '0.022', 'p95': '0.025', 'errors': '0'}
{'p90_0_0': '0.024'}
{'num_0_0': 1908}
{'pid': 6, 'throughput': 100.1, 'p50': '0.020', 'p90': '0.021', 'p95': '0.021', 'errors': '0'}
{'p90_0_0': '0.023'}
{'num_0_0': 2909}
So far we have successfully verified the PR is for CPU only, both fastapi and torchserve settings for the new parameter in the config.properties:
# model_server = fastapi|torchserve
model_server=torchserve
We are planning to release an update for the related AWS guidance shortly containing other important changes without this PR included yet, then will focus on merging this PR upon additional testing on other architectures (AWS Graviton, GPU etc). Thanks
Update 4/17/24: tested this PR using images built for "torchserve" API server on AWS Graviton and Inferentia 2 based nodes. In both cases there were run-time container errors like:
containers: main: Container ID: containerd://42bfe08ada826553ffe57ba56dd93627d71ad75cbc0ee3c19d6e0ad6b953cbc7 Image: public.ecr.aws/a2u7h5w3/bert-base-workshop:v11-torchserve-inf2 Image ID: public.ecr.aws/a2u7h5w3/bert-base-workshop@sha256:5841e70fa95efe1d62f38a51854c187b4af751c59b18fda59d59a2df8a2103e3 Port: 8080/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/usr/local/bin/dockerd-entrypoint.sh": stat /usr/local/bin/dockerd-entrypoint.sh: no such file or directory: unknown Exit Code: 128
------
If the PR merge criteria is to have torchserve based API server work on X86_64 (CPU), Graviton and Inf2 architectures then the above issue must be resolved
Also, the
...
LABEL description="Model $MODEL_NAME packed in a TorchServe container to run on $PROCESSOR"
#add this command below
RUN mkdir -p /home/model-server/model-store
RUN wget https://torchserve.pytorch.org/mar_files/bert_seqc_without_torchscript.mar -O /home/model-server/model-store/BERTSC.mar
in order for ./pack.sh command to work.
@dzilbermanvmw Thanks for checking. I havent tested them on both inf2 and graviton . Will look into these next week
@dzilbermanvmw Thanks for checking. I havent tested them on both inf2 and graviton . Will look into these next week No problem @agunapal - that's what we're here for. FYI on CPU instances the
pack.shcommand also works fine w/o modifying the3-pack\Dockerfile.torchserveand generates a deployable image. So CPU and GPU based instances are OK so far, inf2 and Graviton are not yet..
Adding further details - for CPU, the built and deployment was successful. For GPU, built was successful however for deployment to be successful, we have commented out the below in limits section in 4-deploy/app-bert-base-multilingual-cased-gpu-g4dn.xlarge/bert-base-multilingual-cased-gpu-0.yaml file.
resources: limits: #nvidia.com/gpu:1