serve
serve copied to clipboard
Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error
🐛 Describe the bug
The public TorchServe KFS Image that was recently updated for 0.10.0 has ubuntu:20.04
as its base.
$ docker image inspect pytorch/torchserve-kfs:0.10.0 | grep "org.opencontainers.image.version"
"org.opencontainers.image.version": "20.04"
Intel is publishing an Intel Optimized version of both the torchserve and torchserve-kfs images, which includes Intel Extension for PyTorch. However, due to Intel's Security First policies, we use ubuntu:22.04
as our base image for both containers (soon to be ubuntu:24.04
.
When we deploy with the latest 0.10.0
version of torchserve on kserve, the image immediately enters the CrashLoopBackOff
state due to the following error: exec /usr/local/bin/dockerd-entrypoint.sh: exec format error
.
We determined that the solution to this issue was to change the base back to ubuntu:20.04
, however this means that anyone who intends to create a custom torchserve-kfs container won't be able to use the ubuntu:rolling
base specified in https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L19.
This issue is not present in the previous version my team published, only with the latest kserve and torchserve version, and I was not able to reproduce from the command line, only in my cluster.
Error logs
When using ubuntu:23.10
, it fails during buildtime:
$ ./build-image.sh
...
#11 4.706 Downloading grpcio-tools-1.48.2.tar.gz (2.2 MB)
#11 4.827 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 18.7 MB/s eta 0:00:00
#11 5.054 Preparing metadata (setup.py): started
#11 5.230 Preparing metadata (setup.py): finished with status 'error'
#11 5.234 error: subprocess-exited-with-error
#11 5.234
#11 5.234 × python setup.py egg_info did not run successfully.
#11 5.234 │ exit code: 1
#11 5.234 ╰─> [16 lines of output]
#11 5.234 /home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py:30: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
#11 5.234 import pkg_resources
#11 5.234 Traceback (most recent call last):
#11 5.234 File "<string>", line 2, in <module>
#11 5.234 File "<pip-setuptools-caller>", line 34, in <module>
#11 5.234 File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 180, in <module>
#11 5.234 if check_linker_need_libatomic():
#11 5.234 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234 File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 91, in check_linker_need_libatomic
#11 5.234 cpp_test = subprocess.Popen([cxx, '-x', 'c++', '-std=c++14', '-'],
#11 5.234 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234 File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
#11 5.234 self._execute_child(args, executable, preexec_fn, close_fds,
#11 5.234 File "/usr/lib/python3.11/subprocess.py", line 1950, in _execute_child
#11 5.234 raise child_exception_type(errno_num, err_msg, err_filename)
#11 5.234 FileNotFoundError: [Errno 2] No such file or directory: 'c++'
#11 5.234 [end of output]
#11 5.234
#11 5.234 note: This error originates from a subprocess, and is likely not a problem with pip.
#11 5.236 error: metadata-generation-failed
#11 5.236
#11 5.236 × Encountered error while generating package metadata.
#11 5.236 ╰─> See above for output.
#11 5.236
#11 5.236 note: This is an issue with the package mentioned above, not pip.
#11 5.236 hint: See above for details.
------
executor failed running [/bin/bash -c pip install -r requirements.txt]: exit code: 1
But I am more interested in the output with ubuntu:22.04
, which fails during deployment:
$ kubectl logs vqi-predictor-00001-deployment-8f6cd7bd7-9hl84
Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init)
exec /usr/local/bin/dockerd-entrypoint.sh: exec format error
Installation instructions
Install TorchServe from source? No Are you using Docker? Yes
Model Packaing
n/a
config.properties
n/a
Versions
With ubuntu:22.04
as base
$ python ts_scripts/print_env_info.py
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.10.0
torch-model-archiver==0.10.0
Python version: 3.10 (64-bit runtime)
Python executable: /home/venv/bin/python
Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.26.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu
Java Version:
OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A
Environment:
library_path (LD_/DYLD_):
With ubuntu:20.04
as base
python ts_scripts/print_env_info.py
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.10.0
torch-model-archiver==0.10.0
Python version: 3.8 (64-bit runtime)
Python executable: /home/venv/bin/python
Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.24.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu
Java Version:
OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A
Environment:
library_path (LD_/DYLD_):
Repro instructions
From https://github.com/intel/ai-containers,
- Clone the Repository
- Install docker-compose (see main README.md)
- Build the Intel TorchServe container:
export REGISTRY=intel
export REPO=aiops/mlops-ci
cd pytorch
docker compose up --build torchserve
- Setup KServe build
- Comment out these lines https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L4-L5
-
docker tag intel/aiops/mlops-ci:b-0-ubuntu-22.04-pip-py3.10-torchserve intel/torchserve:latest
- Build KServe Container
cd serving
./build-kfs.sh
- Push to Internal Registry
- Modify ClusterServingRuntime
kserve-torchserve
to use the new image - Deploy any example Endpoint
Possible Solution
No response
Before it gets asked here, yes I have tried to capture logs from within the deployed container, however the container does not even start so no other logs are recorded (other than the liveness probe and queue-proxy failing and all of that)
Thanks for reporting..looking into this. Able to repro the error. Earlier we didn't move to 22.04 as the ubuntu 22.04 runners were flaky. I will try running CI on 22.04 to see if its resolved now.
@tylertitsworth Please pull the submodules before you build kfs image
git submodule update --init --recursive
I am able to build it with 22.04 after doing this
docker image inspect pytorch/torchserve-kfs:latest-cpu | grep "org.opencontainers.image.version"
"org.opencontainers.image.version": "22.04"
@agunapal In the build script I use to build this container I pull submodules (https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L9)
I am able to build the container, however, my issue is when it is deployed to k8s.
@agunapal any update on this? Is there any misunderstanding I can help alleviate?
Hi @tylertitsworth I understand the problem. I will get back to you this week.
On ubuntu 22.04, tried running grpc testcases..these worked
test_gRPC_inference_api.py::test_inference_apis PASSED [ 21%]
test_gRPC_inference_api.py::test_inference_stream_apis 2024-04-06T18:20:11,945 [INFO ] W-9024-echo_stream_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9024-echo_stream_1.0-stderr
PASSED [ 21%]
test_gRPC_inference_api.py::test_inference_stream2_apis PASSED [ 22%]
test_gRPC_management_apis.py::test_management_apis PASSED
So, it may be something specific to docker/kserve.. Will try the steps you have mentioned
This issue has been remediated with the latest version of torchserve