gramine icon indicating copy to clipboard operation
gramine copied to clipboard

Forked processes do not get updated file sizes when modified in a multi-process scenario

Open sibiantony opened this issue 2 years ago • 12 comments

Description of the problem

We have an issue with multi-process file system access. Tested with 1.3.1 and also with master branch. Also tested with both chroot and encrypted mounts.

Problem is like this :

  1. Parent and child process shares a path to a filesystem object.
  2. Parent modifies the file after fork() of the child.
  3. Child process do not get the updated size of the file.

Steps to reproduce

Original application has this scenario. This is a simplified C version.

gcc -o fs-ipc fs-ipc.c ./fs-ipc test-fs/test-file

#include <stdio.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>

unsigned long fsize_seek(char * file) {
  FILE * f = fopen(file, "r");
  fseek(f, 0, SEEK_END);
  unsigned long len = (unsigned long) ftell(f);
  fclose(f);
  return len;
}

off_t fsize_stat(const char * filename) {
  struct stat st;
  if (stat(filename, & st) == 0)
    return st.st_size;
  return -1;
}

/* Loop wait for file changes */
int loop_wait(char * filename) {
  int i = 0;
  while (i < 5) {
    printf("Child file size seek : %d \n", fsize_seek(filename));
    printf("Child file size stat : %d\n", fsize_stat(filename));
    sleep(1);
    i++;
  }
}

int main(int argc, char * argv[]) {
  char * filename;
  if (argc != 2) {
    printf("No filename supplied\n");
    return 1;
  }
  filename = argv[1];

  /* fork child process */
  pid_t pid = fork();
  if (pid != 0) {
    printf("Parent modifying file\n");
    /* parent process modifies file */
    sleep(1);
    FILE * fp = fopen(filename, "a");
    if (fp == NULL) {
      printf("Error opening file from parent!\n");
    }
    fprintf(fp, "Hello Gramine!");
    fclose(fp);
    wait(NULL);
    printf("Child Complete \n");
  } else {
    printf("Child process\n");
    loop_wait(filename);
  }
}

Manifest file.

loader.entrypoint = "file:{{ gramine.libos }}"
libos.entrypoint = "/app/fs-ipc"
loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}:/usr/lib:/usr/{{ arch_libdir }}"
loader.argv_src_file = "file:/app/app_args.txt"
loader.log_level = "debug"

sgx.enclave_size = "1G"
sgx.thread_num = 64
sgx.nonpie_binary = true

sgx.trusted_files = [
  "file:{{ gramine.libos }}",
  "file:{{ gramine.runtimedir() }}/",
  "file:{{ arch_libdir }}/",
  "file:/usr/",
  "file:/app/",
]

sgx.allowed_files = ["file:/etc/host.conf",
    "file:/etc/hosts",
    "file:/etc/nsswitch.conf",
    "file:/etc/resolv.conf",
    "file:/test-fs/"
    ]

fs.mounts = [
  { path = "/lib",                                      uri = "file:{{ gramine.runtimedir() }}" },
  { path = "{{ arch_libdir }}",                         uri = "file:{{ arch_libdir }}" },
  { path = "/usr/{{ arch_libdir }}",                    uri = "file:/usr/{{ arch_libdir }}" },
  { path = "/app/",                                    uri = "file:/app" },
  { path = "/etc/resolv.conf",                          uri = "file:/etc/resolv.conf" },
  { path = "/etc/nsswitch.conf",                        uri = "file:/etc/nsswitch.conf" },
  { path = "/etc/host.conf",                            uri = "file:/etc/host.conf" },
  { path = "/etc/hosts",                                uri = "file:/etc/hosts" },
  { path = "/test-fs",                                  uri = "file:/test-fs" },
  { path = "/usr/",                                  uri = "file:/usr/" },

]

We use dockerized gramineos (both 1.3.1 and master branches)

FROM gramineos:master
COPY ./fs-ipc /app/fs-ipc
COPY fs-ipc.manifest.template /app/
COPY ./entrypoint.sh /app/


COPY ./key.pem /root/.config/gramine/enclave-key.pem
WORKDIR /app
RUN gramine-argv-serializer "/app/fs-ipc" "/test-fs/test-file" >app_args.txt

RUN gramine-manifest -Dlog_level=error -Darch_libdir=/lib/x86_64-linux-gnu fs-ipc.manifest.template fs-ipc.manifest \
    && gramine-sgx-sign --manifest fs-ipc.manifest --output fs-ipc.manifest.sgx

ENTRYPOINT ["sh", "entrypoint.sh"]
#!/bin/sh

/aesmd.sh
gramine-sgx-get-token --output fs-ipc.token --sig fs-ipc.sig
gramine-sgx fs-ipc

Expected results

Child process should get the updated file size. When file is read all updated bytes should be read.

Actual results

Child process gets the original file size before being modified by parent. File read only gets the truncated content.

Gramine commit hash

e18bc05

sibiantony avatar Jan 19 '23 11:01 sibiantony

Thanks for the great summary of the issue!

One thing is missing though, before we find the root cause: do you supply files from /test-fs/ as inputs (argv[1]) to your application?

dimakuv avatar Jan 19 '23 11:01 dimakuv

Yes, I've updated the steps to reproduce. Also updated a Dockerfile for your convenience.

sibiantony avatar Jan 19 '23 11:01 sibiantony

Ok, I looked into this.

This is a complicated problem. Basically, Gramine has the inode concept for each file (regardless of whether the file is opened or not). And this inode is almost never freed in Gramine (I'm talking about chroot files only for now).

So the problem stems from the fact that once you opened some file (e.g. /test-fs/foo), then this file's metadata -- including size -- just stays inside Gramine, in a corresponding inode object. Even if you close all FDs to this file -- effectively your app "forgets" about this file -- this file's inode is still kept alive in Gramine, for possible future accesses.

Now the hard part: we could try to update the inode during the open() callback (== chroot_open()). However, there may be another file description (libos_handle in Gramine parlance) that was opened and is currently attached to this inode (i.e., two opens of the same file). So if just update inode->size based on the newly created handle, a simulteneous read() on another handle may segfault due to inconsitent state (because there is no lock that prevents such simulteneous operations).

We could introduce a special case: update inode on chroot_open() if there are no other handles that use this inode currently. Then it is safe to update inode->size, it seems. This special case seems reasonable to me (and solves this GitHub issue), however, I didn't find a simple way to check if there are no other users of this inode currently.

dimakuv avatar Jan 19 '23 12:01 dimakuv

Hi Dmitrii, As noticed from the test a newly inserted file by any process, is visible to siblings or children or ancestors. So these metadata of the inode structure are kept in some kind of cache (dcache)? Can the same mechanism apply with an eager fetch of the metadata upon opening the file and leaving it to the application to take care of the semantics?

ckaratzas avatar Jan 19 '23 12:01 ckaratzas

So these metadata of the inode structure are kept in some kind of cache (dcache)?

Yes.

...and leaving it to the application to take care of the semantics?

I'm unsure what you mean by this. Gramine can of course return errors to the app if Gramine notices that some flow is impossible/unsupported/seems buggy. But Gramine's own internal state must be kept consistent, and this is the problem that I currently see with what I described above.

I mean, all in all, the scenario that you have is reasonable and should be supported by Gramine. And I think it can be supported with some minor tweaks, just need to get the synchronization right...

@pwmarcz I know you haven't contributed to Gramine in a while, but maybe you could give your opinion on this problem?

dimakuv avatar Jan 19 '23 13:01 dimakuv

@dimakuv I guess what we could do is retrieve data from the host more often, effectively "refreshing" the inode. That would mean higher overhead but maybe not much higher.

And for production use I think the problem is worse: you probably do not want to access a file like that using sgx.allowed_files, because we cannot trust the host will not change it, which means you'd have to use encrypted files for this. But encrypted files do not support concurrent access at all.

pwmarcz avatar Jan 19 '23 13:01 pwmarcz

I guess what we could do is retrieve data from the host more often, effectively "refreshing" the inode. That would mean higher overhead but maybe not much higher.

@pwmarcz But isn't the main problem in the fact that we don't have good means of locking such "refreshing" of the inode, with respect to other concurrent operations such as reading/writing the file? From a quick look at the code, I didn't find any lock that could be used to keep the inode state consistent across several threads.

dimakuv avatar Jan 19 '23 14:01 dimakuv

@dimakuv @pwmarcz thanks for the quick response!

We could introduce a special case: update inode on chroot_open() if there are no other handles that use this inode currently. Then it is safe to update inode->size, it seems. This special case seems reasonable to me (and solves this GitHub issue), however, I didn't find a simple way to check if there are no other users of this inode currently.

The above C code was only an example. In real life scenario there are multiple reads from child processes, so a quick fix for this specific example might not help.

I guess what we could do is retrieve data from the host more often, effectively "refreshing" the inode. That would mean higher overhead but maybe not much higher.

Does this mechanism ensure consistent size? Isn't it possible that there are read and open between refresh?

sibiantony avatar Jan 19 '23 14:01 sibiantony

But isn't the main problem in the fact that we don't have good means of locking such "refreshing" of the inode, with respect to other concurrent operations such as reading/writing the file? From a quick look at the code, I didn't find any lock that could be used to keep the inode state consistent across several threads.

I don't remember very well, but an inode should have a lock, and we take that lock every time we read (or update) the file size.

pwmarcz avatar Jan 19 '23 14:01 pwmarcz

@pwmarcz @dimakuv So as I understand we have 2 problems to solve. One is the propagation of mutations on the inode structure per enclave (multiple threads on the same enclave applying changes) and our case scenario that processes (other enclaves) do not get the updated size when just trying to read the inode structure. So without eager fetch each time I don't see a way of solving the issue (without some ipc sync complex mechanism). As for the in-enclave sync there is a lock in the inode as @pwmarcz mentioned

ckaratzas avatar Jan 19 '23 14:01 ckaratzas

our case scenario that processes (other enclaves) do not get the updated size when just trying to read the inode structure.

I think that's the same scenario... If we solve the "eager fetch" of inode->size in one Gramine enclave, then it will automatically help in your two-Gramine-enclaves scenario.

dimakuv avatar Jan 19 '23 14:01 dimakuv

A similar issue happens in Gunicorn, see https://github.com/gramineproject/examples/pull/80.

Basically, Gunicorn has this logic:

  • the worker process is supposed to update the ctime of the shared file,
  • the parent process verifies that the worker process is alive by checking the ctime of this file every second.

ctime file metadata sharing between processes is currently not supported by Gramine.

(See also a similar issue https://github.com/gramineproject/gramine/issues/254.)

dimakuv avatar Jan 25 '24 10:01 dimakuv