ml-commons
ml-commons copied to clipboard
[BUG] model deployment fails as pytorch files fail to download in restricted environments
What is the bug? Pytorch is downloaded as part of DJIL initialization. In kubernetes environments with restricted egress, it fails with below error.
[2025-03-18T21:35:32,414][ERROR][o.o.m.e.a.DLModel ] [opensearch-cluster-nodes-0] Failed to deploy model 3WHcqpUBVdmFfogFGkie
ai.djl.engine.EngineException: Failed to save pytorch index file
at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:429) ~[pytorch-engine-0.31.1.jar:?]
at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[pytorch-engine-0.31.1.jar:?]
at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[pytorch-engine-0.31.1.jar:?]
at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[pytorch-engine-0.31.1.jar:?]
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[pytorch-engine-0.31.1.jar:?]
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[pytorch-engine-0.31.1.jar:?]
at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[api-0.31.1.jar:?]
at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:188) ~[opensearch-ml-algorithms-2.19.1.0.jar:?]
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) [opensearch-ml-algorithms-2.19.1.0.jar:?]
at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) [?:?]
at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:252) [opensearch-ml-algorithms-2.19.1.0.jar:?]
at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:142) [opensearch-ml-algorithms-2.19.1.0.jar:?]
at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:144) [opensearch-ml-algorithms-2.19.1.0.jar:?]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$49(MLModelManager.java:1274) [opensearch-ml-2.19.1.0.jar:2.19.1.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.1.jar:2.19.1]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$77(MLModelManager.java:2150) [opensearch-ml-2.19.1.0.jar:2.19.1.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.1.jar:2.19.1]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.19.1.jar:2.19.1]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1014) [opensearch-2.19.1.jar:2.19.1]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.19.1.jar:2.19.1]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.net.ConnectException: Connection timed out
at java.base/sun.nio.ch.Net.connect0(Native Method) ~[?:?]
at java.base/sun.nio.ch.Net.connect(Net.java:589) ~[?:?]
at java.base/sun.nio.ch.Net.connect(Net.java:578) ~[?:?]
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:583) ~[?:?]
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) ~[?:?]
at java.base/java.net.Socket.connect(Socket.java:751) ~[?:?]
at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304) ~[?:?]
at java.base/sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:181) ~[?:?]
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:183) ~[?:?]
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:531) ~[?:?]
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:636) ~[?:?]
at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264) ~[?:?]
at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:377) ~[?:?]
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:193) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1252) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1138) ~[?:?]
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:179) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1690) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1614) ~[?:?]
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:223) ~[?:?]
at ai.djl.util.Utils.openUrl(Utils.java:519) ~[api-0.31.1.jar:?]
at ai.djl.util.Utils.openUrl(Utils.java:498) ~[api-0.31.1.jar:?]
at ai.djl.util.Utils.openUrl(Utils.java:487) ~[api-0.31.1.jar:?]
at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:424) ~[pytorch-engine-0.31.1.jar:?]
... 22 more
[2025-03-18T21:35:32,451][ERROR][o.o.m.m.MLModelManager ] [opensearch-cluster-nodes-0] Failed to retrieve model 3WHcqpUBVdmFfogFGkie
org.opensearch.ml.common.exception.MLException: Failed to deploy model 3WHcqpUBVdmFfogFGkie
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:300) ~[?:?]
at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:252) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:142) ~[?:?]
at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:144) ~[?:?]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$49(MLModelManager.java:1274) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.1.jar:2.19.1]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$77(MLModelManager.java:2150) [opensearch-ml-2.19.1.0.jar:2.19.1.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.1.jar:2.19.1]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.19.1.jar:2.19.1]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1014) [opensearch-2.19.1.jar:2.19.1]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.19.1.jar:2.19.1]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: ai.djl.engine.EngineException: Failed to save pytorch index file
at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:429) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[?:?]
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?]
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[?:?]
at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:188) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?]
... 14 more
Caused by: java.net.ConnectException: Connection timed out
at java.base/sun.nio.ch.Net.connect0(Native Method) ~[?:?]
at java.base/sun.nio.ch.Net.connect(Net.java:589) ~[?:?]
at java.base/sun.nio.ch.Net.connect(Net.java:578) ~[?:?]
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:583) ~[?:?]
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) ~[?:?]
at java.base/java.net.Socket.connect(Socket.java:751) ~[?:?]
at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304) ~[?:?]
at java.base/sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:181) ~[?:?]
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:183) ~[?:?]
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:531) ~[?:?]
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:636) ~[?:?]
at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264) ~[?:?]
at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:377) ~[?:?]
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:193) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1252) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1138) ~[?:?]
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:179) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1690) ~[?:?]
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1614) ~[?:?]
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:223) ~[?:?]
at ai.djl.util.Utils.openUrl(Utils.java:519) ~[?:?]
at ai.djl.util.Utils.openUrl(Utils.java:498) ~[?:?]
at ai.djl.util.Utils.openUrl(Utils.java:487) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:424) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[?:?]
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?]
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[?:?]
at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:188) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?]
... 14 more
[2025-03-18T21:35:32,459][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-cluster-nodes-0] deploy model task
How can one reproduce the bug? Probably the simplest would be:
- Clear opensearch data/ml_cache directory.
- Disconnect from internet.
- Start opensearch and attempt to deploy a custom model from a local dir.
What is the expected behavior? Opensearch image must contain necessary dependencies or provide means to add these dependencies at image build or run time.
When using opensearch operator:
- It's not possible to use a custom init containers to copy files from another image to data/ml_cache/pytorch.
- It's not possible to bake pytorch files into data/ directory at build time as it is an empty volume mounted to the container.
Either the ml_cache/pytroch needs to be outside of data/ directory or the opensearch image should include the necessary files.
What is your host/environment? Ubuntu 24.04
Do you have any screenshots? N/A
Do you have any additional context? It is critical for some of our customers to limit egress.
hey
Hi guys, let me share the another way to reproduce this issue:
- Turn off the internet connection
- Allow registering model from local files:
PUT /_cluster/settings
{
"persistent" : {
"plugins.ml_commons.allow_registering_model_via_local_file": true,
"plugins.ml_commons.allow_registering_model_via_url": true
}
}
- Use
opensearch-py-mlto register model:
from opensearchpy import OpenSearch
from opensearch_py_ml.ml_commons import MLCommonClient
client = OpenSearch(
hosts=[CLUSTER_URL],
http_auth=(username, password),
verify_certs=False
)
ml_client = MLCommonClient(client)
model_path = 'sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip' #the path to your model
model_config_path = 'config.json' #the path to your model config
model_id = ml_client.register_model(
model_path=model_path,
model_config_path=model_config_path,
isVerbose=True
)
@maxlepikhin I uploaded local pytorch dependencies to data/ml_cache/pytorch, and everything works fine.
If you need help, please let me know - [email protected]
@jngz-es @dhrubo-os @ylwu-amzn I believe this is not issue, because storing local pytorch dependencies in the project isn't memory efficient, so please close it
Thanks @Yerzhaisang for raising and investigating this issue. I agree with you about it is not efficient to download everything and cache them. I am closing the issue. Feel free to reopen it, if any questions.
@jngz-es @dhrubo-os @ylwu-amzn @Yerzhaisang how are the torch/DJL binaries scanned for CVEs if they are downloaded at runtime?
For others who hit the same issue in containerized environments, the solution is to copy required files into a custom opensearch image:
# Define PyTorch version
# Defined at: https://github.com/deepjavalibrary/djl/blob/41f75681aab8708c375e94f0a99ad7673a74f7ae/bom/build.gradle.kts#L135
PYTORCH_VERSION="1.13.1"
# Defined at: https://github.com/opensearch-project/ml-commons/blob/5bb035e2f5edb8ea936faedb403d6414695463fe/ml-algorithms/build.gradle#L48
DJL_VERSION="0.31.1"
CACHE_DIR="./data/ml_cache/pytorch"
INDEX_FILE="${CACHE_DIR}/${PYTORCH_VERSION}.txt"
BASE_URL="https://publish.djl.ai/pytorch/${PYTORCH_VERSION}"
# Define supported platforms and flavors
PLATFORMS=("linux-x86_64" "linux-aarch64")
FLAVORS=("cpu" "cpu-precxx11")
# Ensure the cache directory exists
mkdir -p "$CACHE_DIR"
# Download index file if it does not exist
if [[ ! -f "$INDEX_FILE" ]]; then
echo "Downloading index file..."
curl -fsSL "${BASE_URL}/files.txt" -o "${INDEX_FILE}"
fi
# Function to decode URL-encoded filenames (fix %2B -> +, etc.)
decode_url() {
echo -e "$(printf '%b' "${1//%/\\x}")"
}
# Download and extract all necessary files for each platform/flavor combination
for PLATFORM in "${PLATFORMS[@]}"; do
for FLAVOR in "${FLAVORS[@]}"; do
DEST_DIR="${CACHE_DIR}/${PYTORCH_VERSION}-${FLAVOR}-${PLATFORM}"
mkdir -p "$DEST_DIR"
# Download DJL JNI.
# Example: https://publish.djl.ai/pytorch/1.13.1/jnilib/0.31.1/linux-x86_64/cpu/libdjl_torch.so
JNI_URL="$BASE_URL/jnilib/$DJL_VERSION/$PLATFORM/$FLAVOR/libdjl_torch.so"
echo "Downloading $JNI_URL ..."
DEST_FILE="${DEST_DIR}/${DJL_VERSION}-libdjl_torch.so"
set +e
curl -fSL "$JNI_URL" -o "$DEST_FILE"
CURL_EXIT_CODE=$?
set -e
if [[ $CURL_EXIT_CODE -ne 0 ]]; then
# cpu flavor and osx are not available, report an error and continue.
echo "--- Failed to download $JNI_URL"
fi
# Download Pytorch native binaries.
echo "Downloading PyTorch native libraries for ${FLAVOR} on ${PLATFORM}..."
while IFS= read -r line; do
if [[ "$line" == "${FLAVOR}/${PLATFORM}/"* ]]; then
FILE_NAME=$(basename "$line" .gz)
DECODED_FILE_NAME=$(decode_url "$FILE_NAME") # Fix C++ filename issues
URL="${BASE_URL}/${line}"
DEST_FILE="${DEST_DIR}/${DECODED_FILE_NAME}"
echo "Downloading ${URL} -> ${DEST_FILE}..."
curl -fsSL "${URL}" | gunzip -c > "${DEST_FILE}"
chmod 644 "${DEST_FILE}"
fi
done < "$INDEX_FILE"
echo "PyTorch native libraries downloaded to ${DEST_DIR}"
done
done
@maxlepikhin WhiteSource Security Check does his job very well
@jngz-es @dhrubo-os @ylwu-amzn @Yerzhaisang how are the torch/DJL binaries scanned for CVEs if they are downloaded at runtime?
For others who hit the same issue in containerized environments, the solution is to copy required files into a custom opensearch image:
# Define PyTorch version # Defined at: https://github.com/deepjavalibrary/djl/blob/41f75681aab8708c375e94f0a99ad7673a74f7ae/bom/build.gradle.kts#L135 PYTORCH_VERSION="1.13.1" # Defined at: https://github.com/opensearch-project/ml-commons/blob/5bb035e2f5edb8ea936faedb403d6414695463fe/ml-algorithms/build.gradle#L48 DJL_VERSION="0.31.1" CACHE_DIR="./data/ml_cache/pytorch" INDEX_FILE="${CACHE_DIR}/${PYTORCH_VERSION}.txt" BASE_URL="https://publish.djl.ai/pytorch/${PYTORCH_VERSION}" # Define supported platforms and flavors PLATFORMS=("linux-x86_64" "linux-aarch64") FLAVORS=("cpu" "cpu-precxx11") # Ensure the cache directory exists mkdir -p "$CACHE_DIR" # Download index file if it does not exist if [[ ! -f "$INDEX_FILE" ]]; then echo "Downloading index file..." curl -fsSL "${BASE_URL}/files.txt" -o "${INDEX_FILE}" fi # Function to decode URL-encoded filenames (fix %2B -> +, etc.) decode_url() { echo -e "$(printf '%b' "${1//%/\\x}")" } # Download and extract all necessary files for each platform/flavor combination for PLATFORM in "${PLATFORMS[@]}"; do for FLAVOR in "${FLAVORS[@]}"; do DEST_DIR="${CACHE_DIR}/${PYTORCH_VERSION}-${FLAVOR}-${PLATFORM}" mkdir -p "$DEST_DIR" # Download DJL JNI. # Example: https://publish.djl.ai/pytorch/1.13.1/jnilib/0.31.1/linux-x86_64/cpu/libdjl_torch.so JNI_URL="$BASE_URL/jnilib/$DJL_VERSION/$PLATFORM/$FLAVOR/libdjl_torch.so" echo "Downloading $JNI_URL ..." DEST_FILE="${DEST_DIR}/${DJL_VERSION}-libdjl_torch.so" set +e curl -fSL "$JNI_URL" -o "$DEST_FILE" CURL_EXIT_CODE=$? set -e if [[ $CURL_EXIT_CODE -ne 0 ]]; then # cpu flavor and osx are not available, report an error and continue. echo "--- Failed to download $JNI_URL" fi # Download Pytorch native binaries. echo "Downloading PyTorch native libraries for ${FLAVOR} on ${PLATFORM}..." while IFS= read -r line; do if [[ "$line" == "${FLAVOR}/${PLATFORM}/"* ]]; then FILE_NAME=$(basename "$line" .gz) DECODED_FILE_NAME=$(decode_url "$FILE_NAME") # Fix C++ filename issues URL="${BASE_URL}/${line}" DEST_FILE="${DEST_DIR}/${DECODED_FILE_NAME}" echo "Downloading ${URL} -> ${DEST_FILE}..." curl -fsSL "${URL}" | gunzip -c > "${DEST_FILE}" chmod 644 "${DEST_FILE}" fi done < "$INDEX_FILE" echo "PyTorch native libraries downloaded to ${DEST_DIR}" done done
Thanks @maxlepikhin I'm glad to know that your problem is solved and thanks for sharing the solution here as well. Can you please add raise a PR to add a documentation about this in the docs directory
Regarding your question about CVE checks, I don't think there's any CVE check for run time. However we use the same torch version in opensearch-py-ml which is being scanned for CVE issues.
Please let me know if that answers your question. Thanks.
@dhrubo-os you are welcome. The question about CVEs was not to learn how the scans are done but to point out to the fact that if anybody takes a dependency on opensearch docker image, they will scan it for vulnerabilities and will miss DJL and pytorch binaries downloaded at runtime.
@dhrubo-os you are welcome. The question about CVEs was not to learn how the scans are done but to point out to the fact that if anybody takes a dependency on opensearch docker image, they will scan it for vulnerabilities and will miss DJL and pytorch binaries downloaded at runtime.
Yeah agree which is why we use the same torch version what was scanned in the py-ml repo. But if you find out any better way to add this in the compile time, please feel free to raise a PR.