guidance-for-machine-learning-inference-on-aws Changes to support TorchServe on cpu & gpu

*What is the PR about

This PR is for integrating TorchServe with this solution

Supports CPU & GPU
Tested with ./test.sh run bmk

From UX POV, User needs to change model_server=torchserve in config.properties. Rest of the flow is the same. Currently, this is supported for CPU only

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

CPU Logs

kubectl logs bert-base-workshop-0-7kmgk -n mpi
/app/tests /home/model-server
Configuring number of model servers from config.properties ...
Number of model servers (1) configured from environment ...
Namespace(url='http://bert-base-multilingual-cased-cpu-[INSTANCE_IDX].mpi.svc.cluster.local:8080/predictions/model[MODEL_IDX]', num_thread=2, latency_window_size=1000, throughput_time=180, throughput_interval=10, is_multi_instance=True, n_instance=1, is_multi_model_per_instance=True, n_model_per_instance=1, post=True, verbose=True, cache_dns=True, model_server='torchserve')
caching dns
http://bert-base-multilingual-cased-cpu-0.mpi.svc.cluster.local:8080/predictions/model0
http://10.100.115.203:8080/predictions/model0
<Response [200]>
{'pid': 6, 'throughput': 0.0, 'p50': '0.000', 'p90': '0.000', 'p95': '0.000', 'errors': '0'}
{}
{}

{'pid': 6, 'throughput': 5.7, 'p50': '0.358', 'p90': '0.402', 'p95': '0.414', 'errors': '0'}
{'p90_0_0': '0.402'}
{'num_0_0': 57}

{'pid': 6, 'throughput': 5.9, 'p50': '0.318', 'p90': '0.392', 'p95': '0.412', 'errors': '0'}
{'p90_0_0': '0.392'}
{'num_0_0': 116}

{'pid': 6, 'throughput': 6.4, 'p50': '0.311', 'p90': '0.387', 'p95': '0.409', 'errors': '0'}
{'p90_0_0': '0.387'}
{'num_0_0': 180}

GPU logs

kubectl logs bert-base-workshop-0-7l2rg -n mpi                        
/app/tests /home/model-server
Configuring number of model servers from config.properties ...
Number of model servers (1) configured from environment ...
Namespace(url='http://bert-base-multilingual-cased-gpu-[INSTANCE_IDX].mpi.svc.cluster.local:8080/predictions/model[MODEL_IDX]', num_thread=2, latency_window_size=1000, throughput_time=180, throughput_interval=10, is_multi_instance=True, n_instance=1, is_multi_model_per_instance=True, n_model_per_instance=1, post=True, verbose=True, cache_dns=True, model_server='torchserve')
caching dns
http://bert-base-multilingual-cased-gpu-0.mpi.svc.cluster.local:8080/predictions/model0
http://10.100.120.85:8080/predictions/model0
<Response [200]>
{'pid': 6, 'throughput': 0.0, 'p50': '0.000', 'p90': '0.000', 'p95': '0.000', 'errors': '0'}
{}
{}

{'pid': 6, 'throughput': 92.5, 'p50': '0.021', 'p90': '0.025', 'p95': '0.027', 'errors': '0'}
{'p90_0_0': '0.025'}
{'num_0_0': 925}

{'pid': 6, 'throughput': 98.3, 'p50': '0.020', 'p90': '0.022', 'p95': '0.025', 'errors': '0'}
{'p90_0_0': '0.024'}
{'num_0_0': 1908}

{'pid': 6, 'throughput': 100.1, 'p50': '0.020', 'p90': '0.021', 'p95': '0.021', 'errors': '0'}
{'p90_0_0': '0.023'}
{'num_0_0': 2909}

Jan 25 '24 22:01 agunapal

So far we have successfully verified the PR is for CPU only, both fastapi and torchserve settings for the new parameter in the config.properties:

# model_server = fastapi|torchserve
model_server=torchserve

We are planning to release an update for the related AWS guidance shortly containing other important changes without this PR included yet, then will focus on merging this PR upon additional testing on other architectures (AWS Graviton, GPU etc). Thanks

Feb 12 '24 22:02 dzilbermanvmw

Update 4/17/24: tested this PR using images built for "torchserve" API server on AWS Graviton and Inferentia 2 based nodes. In both cases there were run-time container errors like: containers: main: Container ID: containerd://42bfe08ada826553ffe57ba56dd93627d71ad75cbc0ee3c19d6e0ad6b953cbc7 Image: public.ecr.aws/a2u7h5w3/bert-base-workshop:v11-torchserve-inf2 Image ID: public.ecr.aws/a2u7h5w3/bert-base-workshop@sha256:5841e70fa95efe1d62f38a51854c187b4af751c59b18fda59d59a2df8a2103e3 Port: 8080/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/usr/local/bin/dockerd-entrypoint.sh": stat /usr/local/bin/dockerd-entrypoint.sh: no such file or directory: unknown Exit Code: 128 ------ If the PR merge criteria is to have torchserve based API server work on X86_64 (CPU), Graviton and Inf2 architectures then the above issue must be resolved

Apr 17 '24 21:04 dzilbermanvmw

Also, the /3-pack/Dockerfile.torchserve file needs to have this command before the:

...
LABEL description="Model $MODEL_NAME packed in a TorchServe container to run on $PROCESSOR"
#add this command below
RUN mkdir -p /home/model-server/model-store   
RUN wget https://torchserve.pytorch.org/mar_files/bert_seqc_without_torchscript.mar -O /home/model-server/model-store/BERTSC.mar

in order for ./pack.sh command to work.

Apr 18 '24 22:04 dzilbermanvmw

@dzilbermanvmw Thanks for checking. I havent tested them on both inf2 and graviton . Will look into these next week

Apr 24 '24 01:04 agunapal

@dzilbermanvmw Thanks for checking. I havent tested them on both inf2 and graviton . Will look into these next week No problem @agunapal - that's what we're here for. FYI on CPU instances the pack.sh command also works fine w/o modifying the 3-pack\Dockerfile.torchserve and generates a deployable image. So CPU and GPU based instances are OK so far, inf2 and Graviton are not yet..

Apr 24 '24 04:04 dzilbermanvmw

Adding further details - for CPU, the built and deployment was successful. For GPU, built was successful however for deployment to be successful, we have commented out the below in limits section in 4-deploy/app-bert-base-multilingual-cased-gpu-g4dn.xlarge/bert-base-multilingual-cased-gpu-0.yaml file.

resources: limits: #nvidia.com/gpu:1

Apr 24 '24 23:04 sridevi1209