examples AESM Error while executing Confidential PyTorch Example in Docker

Hi, I am trying to run the end-to-end confidential pytorch example from this tutorial. I was able to run the non-confidential part of the tutorial using gramine-sgx, but I am running into the following error when trying to run the confidential example:

root@xyz:pytorch-confidential# gramine-sgx ./pytorch pytorchexample.py
Gramine is starting. Parsing TOML manifest file, this may take some time...
error: Cannot connect to AESM service (tried sgx_aesm_socket_base and /var/run/aesmd/aesm.socket UNIX sockets).
Please check its status! (`service aesmd status` on Ubuntu)
error: load_enclave() failed with error: No such file or directory (ENOENT)

When I try to run service aesmd status I get the following output:

root@xyz:pytorch-confidential# service aesmd status
aesmd: unrecognized service

I followed the tutorial and I can see that the sgx-aesm-service service is installed. The docker file I am using to run Gramine is:

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND=noninteractive
ENV LC_ALL=C.UTF-8 LANG=C.UTF-8

# Main Dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
        wget \
        gnupg \
        ca-certificates \
        software-properties-common \
        libnss-mdns \
        libnss-myhostname \
        git \
        curl \
        linux-headers-5.15.0-52-generic \
        openssh-client \
        screen \
        && apt-get clean && rm -rf /var/lib/apt/lists/*

# Gramine Dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        autoconf \ 
        bison \ 
        gawk \
        nasm \
        ninja-build \
        pkg-config \
        python3 \
        python3-click \
        python3-jinja2 \
        python3-pip \
        python3-pyelftools 

RUN python3 -m pip install 'meson>=0.56' 'tomli>=1.1.0' 'tomli-w>=0.4.0'

# Intel SGX-related Dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
        libprotobuf-c-dev \
        protobuf-c-compiler \
        protobuf-compiler \
        python3-cryptography \
        python3-protobuf

# Intel SGX SDK/PSW
RUN ["/bin/bash", "-c", "set -o pipefail && echo 'deb [trusted=yes arch=amd64] https://download.01.org/intel-sgx/sgx_repo/ubuntu focal main' | tee /etc/apt/sources.list.d/intel-sgx.list"]
RUN ["/bin/bash", "-c", "set -o pipefail && wget -qO - https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key | apt-key add -"]

RUN apt-get update && apt-get install -y --no-install-recommends \
        libsgx-epid \
        libsgx-quote-ex \
        libsgx-dcap-ql \
        libsgx-quote-ex \
        libsgx-quote-ex-dev \
        libsgx-qe3-logic \
        sgx-aesm-service 

# DCAP 
RUN curl -fsSLo /usr/share/keyrings/intel-sgx-deb.asc https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key
RUN apt-get update && apt-get install -y --no-install-recommends \
        libsgx-dcap-ql-dev \
        libsgx-dcap-quote-verify-dev \
        libsgx-dcap-default-qpl \
        libsgx-dcap-default-qpl-dev \
        && apt-get clean && rm -rf /var/lib/apt/lists/*

# Build and Install Gramine
ENV HOMEDIR=/home
ENV GRAMINEDIR=${HOMEDIR}/gramine

WORKDIR ${GRAMINEDIR}
RUN git clone https://github.com/gramineproject/gramine.git ${GRAMINEDIR} 

RUN meson setup build/ --buildtype=release -Ddirect=enabled -Dsgx=enabled -Ddcap=enabled 
RUN ninja -C build/ 
RUN ninja -C build/ install 
RUN gramine-sgx-gen-private-key

# Install PyTorch
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

The manifest template (edited as shown in the tutorial):

# SPDX-License-Identifier: LGPL-3.0-or-later

# PyTorch manifest template

loader.entrypoint = "file:{{ gramine.libos }}"
libos.entrypoint = "{{ entrypoint }}"

loader.log_level = "{{ log_level }}"

loader.env.LD_LIBRARY_PATH = "/lib:/usr/lib:{{ arch_libdir }}:/usr/{{ arch_libdir }}"
loader.env.HOME = "{{ env.HOME }}"

# Restrict the maximum number of threads to prevent insufficient memory
# issue, observed on CentOS/RHEL.
loader.env.OMP_NUM_THREADS = "8"

loader.insecure__use_cmdline_argv = true

fs.mounts = [
  { path = "{{ entrypoint }}", uri = "file:{{ entrypoint }}" },
  { path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
  { path = "/usr/lib", uri = "file:/usr/lib" },
  { path = "{{ arch_libdir }}", uri = "file:{{ arch_libdir }}" },
  { path = "/usr/{{ arch_libdir }}", uri = "file:/usr/{{ arch_libdir }}" },
{% for path in python.get_sys_path(entrypoint) %}
  { path = "{{ path }}", uri = "file:{{ path }}" },
{% endfor %}

  { type = "tmpfs", path = "/tmp" },

  { path = "/classes.txt", uri = "file:classes.txt", type = "encrypted" },
  { path = "/input.jpg", uri = "file:input.jpg", type = "encrypted" },
  { path = "/alexnet-pretrained.pt", uri = "file:alexnet-pretrained.pt", type = "encrypted" },
  
  { path = "/result.txt", uri = "file:result.txt", type = "encrypted" },
]

sgx.enclave_size = "4G"
sgx.max_threads = 32
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}

sgx.trusted_files = [
  "file:{{ entrypoint }}",
  "file:{{ gramine.libos }}",
  "file:{{ gramine.runtimedir() }}/",
  "file:/usr/lib/",
  "file:{{ arch_libdir }}/",
  "file:/usr/{{ arch_libdir }}/",
{% for path in python.get_sys_path(entrypoint) %}
  "file:{{ path }}{{ '/' if path.is_dir() else '' }}",
{% endfor %}

  "file:pytorchexample.py",

]

sgx.allowed_files = [
  "file:ssl/ca.crt",
]

sys.enable_extra_runtime_domain_names_conf = true

sgx.remote_attestation = "dcap"

loader.env.LD_PRELOAD = "libsecret_prov_attest.so"
loader.env.SECRET_PROVISION_CONSTRUCTOR = "1"
loader.env.SECRET_PROVISION_SET_KEY = "default"
loader.env.SECRET_PROVISION_CA_CHAIN_PATH = "ssl/ca.crt"
loader.env.SECRET_PROVISION_SERVERS = "localhost:4433"

# Gramine optionally provides patched OpenMP runtime library that runs faster inside SGX enclaves
# (add `-Dlibgomp=enabled` when configuring the build). Uncomment the line below to use the patched
# library. PyTorch's SGX perf overhead decreases on some workloads from 25% to 8% with this patched
# library. Note that we need to preload the library because PyTorch's distribution renames
# libgomp.so to smth like libgomp-7c85b1e2.so.1, so it's not just a matter of searching in the
# Gramine's Runtime path first, but a matter of intercepting OpenMP functions.
# loader.env.LD_PRELOAD = "/lib/libgomp.so.1"

I launch the provisioning server before I run the gramine commands and I can see it running in the background using the top command.

I am unsure why the service command cannot find the aesmd service. I can see that the container does indeed contain the following files:

/lib/systemd/system/aesmd.service
/etc/aesmd.conf
/opt/intel/sgx-aesm-service/aesm/aesm_service

The aesmd.conf file looks like this:

#Line with comments only

	  #empty line with comment
#proxy type    = direct #direct type means no proxy used
#proxy type    = default #system default proxy
#proxy type    = manual #aesm proxy should be specified for manual proxy type
#aesm proxy    = http://proxy_url:proxy_port
#whitelist url = http://sample_while_list_url/
#default quoting type = ecdsa_256
#default quoting type = epid_linkable
#default quoting type = epid_unlinkable
#qpl log level = error
#qpl log level = infocat: n: No such file or directory

Have I done something wrong in the installation process, or is something extra required to make this work within a Docker container?

I appreciate any help you can provide.

Best, Asim.

Nov 15 '23 20:11 asim29

@asim29 Thanks for the question!

I think you'll also need to install the plugins of the AESM service (e.g., libsgx-aesm-launch-plugin, pls see the minimal Dockerfile to install Gramine and all required dependencies as a reference). Pls note that there's also a minimal script to restart the SGX-specific aesmd service.

Nov 16 '23 03:11 kailun-qin

+1 to what @kailun-qin said.

Also, to double-check whether the AESMD service is actually running, you can check for existence of this file: /var/run/aesmd/aesm.socket. If this file doesn't exist, then it means that the AESMD service was not started.

Nov 16 '23 06:11 dimakuv

Thank you for the response!

I included the plugins into my Dockerfile, and added the following lines:

# Install AESM Plugins
RUN apt-get update && apt-get install -y --no-install-recommends \
        libsgx-aesm-launch-plugin \
        libsgx-aesm-epid-plugin \
        libsgx-aesm-quote-ex-plugin \
        libsgx-aesm-ecdsa-plugin \
        libsgx-dcap-quote-verify \
        psmisc && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*

I also made sure to restart the AESMD service, and the service itself seems to be working (i.e., the /var/run/aesmd/aesm.socket file exists, and I can see it running when I use top). However, I'm getting a new error when I try to run gramine-sgx now:

Gramine is starting. Parsing TOML manifest file, this may take some time...
error: AESM service returned error 38; this may indicate that infrastructure for the DCAP attestation requested by Gramine is missing on this machine
error: load_enclave() failed with error: Operation not permitted (EPERM)

I've installed the libsgx-dcap-quote-verify-dev package (it's in the Dockerfile under #DCAP). I've also set the -Ddcap=enabled option while building with meson. Is there anything else I'm missing?

Nov 16 '23 19:11 asim29

@asim29 Have you installed the PCCS service? See https://www.intel.com/content/www/us/en/developer/articles/guide/intel-software-guard-extensions-data-center-attestation-primitives-quick-install-guide.html

Context: You need some service that constructs the Intel certificate chain for DCAP SGX Quotes. You can either install the PCCS service, or if you run on Microsoft Azure Confidential Computing VMs with SGX enabled, then it should be already set up to use Microsoft's own service.

Nov 17 '23 08:11 dimakuv

Hi @dimakuv,

When I try to install the PCCS service, I get the following error:

Installing PCCS service ... failed.
Unsupported platform - neither systemctl nor initctl was found.
dpkg: error processing package sgx-dcap-pccs (--configure):
 installed sgx-dcap-pccs package post-installation script subprocess returned error exit status 5
Processing triggers for libc-bin (2.31-0ubuntu9.12) ...
Errors were encountered while processing:
 sgx-dcap-pccs
E: Sub-process /usr/bin/dpkg returned an error code (1)

I might need privileged access to the host machine I am running Docker on. A bit more context: I am trying to run this in a container on a Rootless Docker installation since I do not have root access to the host machine on which SGX is installed, and I do not have access to the system-wide Docker installation either.

I have been trying to figure out how to run systemctl within a Docker container but haven't been able to do that; I get the error described in this StackOverflow post when I try to run systemctl. It seems this isn't recommended.

Does the PCCS service need to be installed on the host machine? Is installing it in a Docker container running on a Rootless Docker installation possible?

Nov 23 '23 16:11 asim29

Does the PCCS service need to be installed on the host machine? Is installing it in a Docker container running on a Rootless Docker installation possible?

@asim29 It can be installed and run w/ a Docker container, pls take https://github.com/intel/SGXDataCenterAttestationPrimitives/tree/master/QuoteGeneration/pccs/container as a reference.

Nov 24 '23 01:11 kailun-qin

Hi! Thank you for the response.

I have been trying to install PCCS within a Docker container using the Dockerfile as a reference, but I get an error with the make command on line 30 of the Dockerfile.

When I try to build the referenced image as it is (the first step in the readme), I get the same error:

Step 9/23 : RUN make
 ---> Running in d2caae537458
make[1]: Entering directory '/SGXDataCenterAttestationPrimitives/tools/PCKCertSelection/PCKCertSelectionLib'
../../../QuoteGeneration/buildenv.mk:71: /opt/intel/sgxsdk/buildenv.mk: No such file or directory
make[1]: *** No rule to make target '/opt/intel/sgxsdk/buildenv.mk'.  Stop.
make[1]: Leaving directory '/SGXDataCenterAttestationPrimitives/tools/PCKCertSelection/PCKCertSelectionLib'
make: *** [Makefile:78: PCKCertSelectionLib] Error 2
The command '/bin/sh -c make' returned a non-zero code: 2

Nov 29 '23 19:11 asim29

@asim29 Ah, this is a known issue. Pls kindly retry w/ the latest master branch of DCAP (as we just merged the fix: https://github.com/intel/SGXDataCenterAttestationPrimitives/commit/b2b7eba4c058a903826cacc94ba92b58a4e51803).

Nov 30 '23 06:11 kailun-qin

Thank you @kailun-qin

I managed to install the PCSS server, and it seems to be working. The output of the command curl -kv https://localhost:8081 inside my Docker container is:

*   Trying 127.0.0.1:8081...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8081 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=CA; ST=Ontario; L=Waterloo; O=University of Waterloo; OU=School of Computer Science; CN=Asim; [email protected]
*  start date: Nov 28 18:15:11 2023 GMT
*  expire date: Nov 27 18:15:11 2024 GMT
*  issuer: C=CA; ST=Ontario; L=Waterloo; O=University of Waterloo; OU=School of Computer Science; CN=Asim; [email protected]
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
> GET / HTTP/1.1
> Host: localhost:8081
> User-Agent: curl/7.68.0
> Accept: */*
> 
2023-11-30 20:10:30.642 [info]: Client Request-ID : 1f2f84dc9f9d4e9780d83f76397d677c
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
2023-11-30 20:10:30.647 [info]: 127.0.0.1 - - [30/Nov/2023:20:10:30 +0000] "GET / HTTP/1.1" 404 139 "-" "curl/7.68.0"

* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< X-Powered-By: Express
< Request-ID: 1f2f84dc9f9d4e9780d83f76397d677c
< Content-Security-Policy: default-src 'none'
< X-Content-Type-Options: nosniff
< Content-Type: text/html; charset=utf-8
< Content-Length: 139
< Date: Thu, 30 Nov 2023 20:10:30 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
< 
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /</pre>
</body>
</html>
* Connection #0 to host localhost left intact

However, the AESM service error I indicated earlier is still there. When I run the command gramine-sgx ./pytorch pytorchexample.py I get the following error:

Gramine is starting. Parsing TOML manifest file, this may take some time...
error: AESM service returned error 38; this may indicate that infrastructure for the DCAP attestation requested by Gramine is missing on this machine
error: load_enclave() failed with error: Operation not permitted (EPERM)

The only difference in the way I installed the PCCS server in my own Dockerfile is that I installed PCCS as the root user, rather than creating a new user for it (for simplicity's sake, I am not sure of the implications of this yet). Will that have potentially caused a problem?

Nov 30 '23 20:11 asim29