gramine icon indicating copy to clipboard operation
gramine copied to clipboard

Failure to run models on ovms binary due to Memory Fault

Open adarshan-intel opened this issue 11 months ago • 3 comments

Description

Please read https://github.com/gramineproject/gsc/issues/229 for context

Downloaded OpenVINO Model Server on bare metal using the instructions from OpenVINO documentation.

Steps to Reproduce:

  1. Download precompiled package:

    wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_ubuntu22.tar.gz
    tar -xzvf ovms_ubuntu22.tar.gz
    
  2. Or build it yourself:

    # Clone the model server repository
    git clone https://github.com/openvinotoolkit/model_server
    cd model_server
    # Build docker images (the binary is one of the artifacts)
    make docker_build PYTHON_DISABLE=1 RUN_TESTS=0
    # Unpack the package
    tar -xzvf dist/ubuntu22/ovms.tar.gz
    
  3. Install required libraries:

    sudo apt update -y && apt install -y libxml2 curl
    
  4. Set path to the libraries and add binary to the PATH:

    export LD_LIBRARY_PATH=${PWD}/ovms/lib
    export PATH=$PATH:${PWD}/ovms/bin
    
  5. Create a Makefile and manifest.template file. (attached in logs)

  6. Run the script runs.sh to execute the file. (attached in logs)

Issue:

Encountered a memory issue. Logs indicate the following error:

(libos_signal.c:351:memfault_upcall) [P1:T1:ovms] debug: memory fault at 0x73e13ed194f8 (IP = 0x73e1266fed2d)

Notes:

  • Initially thought this was an issue with GSC, but it is also reproducible on core Gramine. So raising an issue here as suggested by https://github.com/gramineproject/gsc/issues/229#issuecomment-2576894564
  • Both gramine-direct and gramine-sgx exhibit this issue.
  • Logs are attached: log.zip

adarshan-intel avatar Jan 21 '25 05:01 adarshan-intel

I am able to repro the issue with Gramine v1.7

anjalirai-intel avatar Jan 23 '25 04:01 anjalirai-intel

Looks like I have an related issue on v1.7 here, when running an java application in debug.. It might be an false positive though.

Shown at line 90771, cutout

(libos_signal.c:351:memfault_upcall) [P2:T33:java] debug: memory fault at 0x00000008 (IP = 0xb4399dd6)
(libos_context.c:279:prepare_sigframe) [P2:T33:java] debug: Created sigframe for sig: 11 at 0xc8e2b190 (handler: 0xc9cb67b0, restorer: 0xca3fcf00)
(libos_parser.c:1658:buf_write_all) [P2:T33:java] trace: ---- rt_sigprocmask(UNBLOCK, [SIGILL,SIGTRAP,SIGBUS,SIGFPE,SIGSEGV,], NULL, 0x8) = 0x0

full log attached, i'm going to see how it runs on 1.8 and then on master.. sgxrunlog.zip

nmwael avatar Jan 27 '25 09:01 nmwael

I looked a bit into this issue. It seems that when OVMS starts up, the ModelManager would attempt to log the plugin configuration by calling into OpenVINO to retrieve the list of available devices (see https://github.com/openvinotoolkit/model_server/blob/1888055a9033242f30f2b68fb17e57d7965cd8fd/src/modelmanager.cpp#L159-L164). However, this involves loading and initializing all available plugins, regardless of the target_device specified in the OVMS cmdline. And this somehow fails with the OpenVINO Intel GPU Plugin and causes a segmentation fault:

(libos_rtld.c:1053:register_library) [P1:T1:ovms] debug: glibc register library /home/adarsh2404/adarsh_2404/openvino_bm/ovms/lib/libopenvino_intel_gpu_plugin.so loaded at 0x73e1264d4000
(libos_parser.c:1701:buf_write_all) [P1:T1:ovms] trace: ---- mprotect(0x73e127ea3000, 0x7a000, PROT_READ) ...
(libos_parser.c:1701:buf_write_all) [P1:T1:ovms] trace: ---- return from mprotect(...) = 0x0
(libos_signal.c:351:memfault_upcall) [P1:T1:ovms] debug: memory fault at 0x73e13ed194f8 (IP = 0x73e1266fed2d)
(libos_signal.c:58:sighandler_kill) [P1:T1:ovms] debug: killed by signal 11

So a simple workaround would be to avoid auto-loading the OpenVINO Intel GPU plugin library by e.g., renaming it (mv ovms/lib/libopenvino_intel_gpu_plugin.so ovms/lib/libopenvino_intel_gpu_plugin.so.bk). Could you please re-try w/ this workaround?

We could reach out to the OVMS/OpenVINO team to ask about how to skip such auto device logging configuration detection in OVMS when a specific device backend is explicitly selected. If needed, we could also investigate further to understand why the libopenvino_intel_gpu_plugin library fails during initialization/registration, but since we're on CPU/SGX, this seems to be a lower priority.

kailun-qin avatar Feb 12 '25 04:02 kailun-qin