serve Cannot run program "/usr/bin/python3"...Too many levels of symbolic links

🐛 Describe the bug

When running TorchServe the process logs show java.io.IOException: error=40, Too many levels of symbolic links. This is caused by the creation of self referential links in the /tmp/models/01c0...ef81 directories. I'm sure this is something simple but not exactly sure how to fix it. It happens when running

lrwxrwxrwx  1 ubuntu ubuntu   21 Oct  4 21:36 llama-2-7b-neuronx-b1 -> llama-2-7b-neuronx-b1

Command

torchserve --ncs --start --model-store . --models llama-2-7b-neuronx-b1

Error logs

2023-10-04T21:39:28,020 [ERROR] W-9033-llama_2_7b_chat_hf_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:181) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:339) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] Caused by: java.io.IOException: Cannot run program "/usr/bin/python3" (in directory "/tmp/models/a5f239bbcf3744c1ab9f97845f1e9ffb/llama-2-7b-chat-hf"): error=40, Too many levels of symbolic links at java.lang.ProcessBuilder.start(ProcessBuilder.java:1143) ~[?:?] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1073) ~[?:?] at java.lang.Runtime.exec(Runtime.java:594) ~[?:?] at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:163) ~[model-server.jar:?]

Installation instructions

install torchserve from nightly 2023.9.20 and also tried latest

Model Packaing

torch-model-archiver --model-name llama-2-13b-neuronx-b1 --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive

config.properties

none

Versions

torch-model-archiver==0.8.2

Python version: 3.10 (64-bit runtime) Python executable: /usr/bin/python3

Versions of relevant python libraries: numpy==1.21.6 psutil==5.9.5 requests==2.31.0 requests-unixsocket==0.3.0 sentencepiece==0.1.99 torch==1.13.1 torch-model-archiver==0.8.2 torch-neuronx==1.13.1.1.11.0 torch-xla==1.13.1+torchneuronb torchserve-nightly==2023.9.20 torchvision==0.15.2 transformers==4.33.3 transformers-neuronx==0.7.84 wheel==0.37.1 torch==1.13.1 **Warning: torchtext not present .. torchvision==0.15.2 **Warning: torchaudio not present ..

Java Version:

OS: Ubuntu 22.04.3 LTS GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: N/A CMake version: version 3.27.6

Repro instructions

see https://pytorch.org/blog/high-performance-llama/

Possible Solution

No response

Oct 04 '23 22:10 babeal

cc @namannandan

Oct 04 '23 22:10 agunapal

So I created a folder at ~/model_store and then moved the folder from ~/repos/llama-2-7b-neuronx-b1 to ~/model_store/llama-2-7b-neuronx-b1. The following command now works torchserve --ncs --start --model-store ~/model_store --models llama-2-7b-neuronx-b1. The repos folder had other non torch project related files and folders but that's really the only difference. Some other things that I observed, if I ran --models all with the command it would register every folder as a possible model whether or not it had a MAR-INF manifest or not. So it suggests that extraneous files in the root path of the model_store argument causes it, but i'm unsure.

Oct 05 '23 12:10 babeal

serve serve copied to clipboard

Cannot run program "/usr/bin/python3"...Too many levels of symbolic links

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

serve
serve copied to clipboard