gramine
gramine copied to clipboard
Forked processes do not get updated file sizes when modified in a multi-process scenario
Description of the problem
We have an issue with multi-process file system access. Tested with 1.3.1 and also with master branch. Also tested with both chroot and encrypted mounts.
Problem is like this :
- Parent and child process shares a path to a filesystem object.
- Parent modifies the file after fork() of the child.
- Child process do not get the updated size of the file.
Steps to reproduce
Original application has this scenario. This is a simplified C version.
gcc -o fs-ipc fs-ipc.c
./fs-ipc test-fs/test-file
#include <stdio.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
unsigned long fsize_seek(char * file) {
FILE * f = fopen(file, "r");
fseek(f, 0, SEEK_END);
unsigned long len = (unsigned long) ftell(f);
fclose(f);
return len;
}
off_t fsize_stat(const char * filename) {
struct stat st;
if (stat(filename, & st) == 0)
return st.st_size;
return -1;
}
/* Loop wait for file changes */
int loop_wait(char * filename) {
int i = 0;
while (i < 5) {
printf("Child file size seek : %d \n", fsize_seek(filename));
printf("Child file size stat : %d\n", fsize_stat(filename));
sleep(1);
i++;
}
}
int main(int argc, char * argv[]) {
char * filename;
if (argc != 2) {
printf("No filename supplied\n");
return 1;
}
filename = argv[1];
/* fork child process */
pid_t pid = fork();
if (pid != 0) {
printf("Parent modifying file\n");
/* parent process modifies file */
sleep(1);
FILE * fp = fopen(filename, "a");
if (fp == NULL) {
printf("Error opening file from parent!\n");
}
fprintf(fp, "Hello Gramine!");
fclose(fp);
wait(NULL);
printf("Child Complete \n");
} else {
printf("Child process\n");
loop_wait(filename);
}
}
Manifest file.
loader.entrypoint = "file:{{ gramine.libos }}"
libos.entrypoint = "/app/fs-ipc"
loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}:/usr/lib:/usr/{{ arch_libdir }}"
loader.argv_src_file = "file:/app/app_args.txt"
loader.log_level = "debug"
sgx.enclave_size = "1G"
sgx.thread_num = 64
sgx.nonpie_binary = true
sgx.trusted_files = [
"file:{{ gramine.libos }}",
"file:{{ gramine.runtimedir() }}/",
"file:{{ arch_libdir }}/",
"file:/usr/",
"file:/app/",
]
sgx.allowed_files = ["file:/etc/host.conf",
"file:/etc/hosts",
"file:/etc/nsswitch.conf",
"file:/etc/resolv.conf",
"file:/test-fs/"
]
fs.mounts = [
{ path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
{ path = "{{ arch_libdir }}", uri = "file:{{ arch_libdir }}" },
{ path = "/usr/{{ arch_libdir }}", uri = "file:/usr/{{ arch_libdir }}" },
{ path = "/app/", uri = "file:/app" },
{ path = "/etc/resolv.conf", uri = "file:/etc/resolv.conf" },
{ path = "/etc/nsswitch.conf", uri = "file:/etc/nsswitch.conf" },
{ path = "/etc/host.conf", uri = "file:/etc/host.conf" },
{ path = "/etc/hosts", uri = "file:/etc/hosts" },
{ path = "/test-fs", uri = "file:/test-fs" },
{ path = "/usr/", uri = "file:/usr/" },
]
We use dockerized gramineos (both 1.3.1 and master branches)
FROM gramineos:master
COPY ./fs-ipc /app/fs-ipc
COPY fs-ipc.manifest.template /app/
COPY ./entrypoint.sh /app/
COPY ./key.pem /root/.config/gramine/enclave-key.pem
WORKDIR /app
RUN gramine-argv-serializer "/app/fs-ipc" "/test-fs/test-file" >app_args.txt
RUN gramine-manifest -Dlog_level=error -Darch_libdir=/lib/x86_64-linux-gnu fs-ipc.manifest.template fs-ipc.manifest \
&& gramine-sgx-sign --manifest fs-ipc.manifest --output fs-ipc.manifest.sgx
ENTRYPOINT ["sh", "entrypoint.sh"]
#!/bin/sh
/aesmd.sh
gramine-sgx-get-token --output fs-ipc.token --sig fs-ipc.sig
gramine-sgx fs-ipc
Expected results
Child process should get the updated file size. When file is read all updated bytes should be read.
Actual results
Child process gets the original file size before being modified by parent. File read only gets the truncated content.
Gramine commit hash
e18bc05
Thanks for the great summary of the issue!
One thing is missing though, before we find the root cause: do you supply files from /test-fs/
as inputs (argv[1]
) to your application?
Yes, I've updated the steps to reproduce. Also updated a Dockerfile for your convenience.
Ok, I looked into this.
This is a complicated problem. Basically, Gramine has the inode
concept for each file (regardless of whether the file is opened or not). And this inode
is almost never freed in Gramine (I'm talking about chroot
files only for now).
So the problem stems from the fact that once you opened some file (e.g. /test-fs/foo
), then this file's metadata -- including size -- just stays inside Gramine, in a corresponding inode
object. Even if you close all FDs to this file -- effectively your app "forgets" about this file -- this file's inode
is still kept alive in Gramine, for possible future accesses.
Now the hard part: we could try to update the inode
during the open()
callback (== chroot_open()
). However, there may be another file description (libos_handle
in Gramine parlance) that was opened and is currently attached to this inode
(i.e., two opens of the same file). So if just update inode->size
based on the newly created handle, a simulteneous read()
on another handle may segfault due to inconsitent state (because there is no lock that prevents such simulteneous operations).
We could introduce a special case: update inode
on chroot_open()
if there are no other handles that use this inode
currently. Then it is safe to update inode->size
, it seems. This special case seems reasonable to me (and solves this GitHub issue), however, I didn't find a simple way to check if there are no other users of this inode
currently.
Hi Dmitrii, As noticed from the test a newly inserted file by any process, is visible to siblings or children or ancestors. So these metadata of the inode structure are kept in some kind of cache (dcache)? Can the same mechanism apply with an eager fetch of the metadata upon opening the file and leaving it to the application to take care of the semantics?
So these metadata of the inode structure are kept in some kind of cache (dcache)?
Yes.
...and leaving it to the application to take care of the semantics?
I'm unsure what you mean by this. Gramine can of course return errors to the app if Gramine notices that some flow is impossible/unsupported/seems buggy. But Gramine's own internal state must be kept consistent, and this is the problem that I currently see with what I described above.
I mean, all in all, the scenario that you have is reasonable and should be supported by Gramine. And I think it can be supported with some minor tweaks, just need to get the synchronization right...
@pwmarcz I know you haven't contributed to Gramine in a while, but maybe you could give your opinion on this problem?
@dimakuv I guess what we could do is retrieve data from the host more often, effectively "refreshing" the inode. That would mean higher overhead but maybe not much higher.
And for production use I think the problem is worse: you probably do not want to access a file like that using sgx.allowed_files
, because we cannot trust the host will not change it, which means you'd have to use encrypted files for this. But encrypted files do not support concurrent access at all.
I guess what we could do is retrieve data from the host more often, effectively "refreshing" the inode. That would mean higher overhead but maybe not much higher.
@pwmarcz But isn't the main problem in the fact that we don't have good means of locking such "refreshing" of the inode, with respect to other concurrent operations such as reading/writing the file? From a quick look at the code, I didn't find any lock that could be used to keep the inode
state consistent across several threads.
@dimakuv @pwmarcz thanks for the quick response!
We could introduce a special case: update inode on chroot_open() if there are no other handles that use this inode currently. Then it is safe to update inode->size, it seems. This special case seems reasonable to me (and solves this GitHub issue), however, I didn't find a simple way to check if there are no other users of this inode currently.
The above C code was only an example. In real life scenario there are multiple read
s from child processes, so a quick fix for this specific example might not help.
I guess what we could do is retrieve data from the host more often, effectively "refreshing" the inode. That would mean higher overhead but maybe not much higher.
Does this mechanism ensure consistent size? Isn't it possible that there are read
and open
between refresh?
But isn't the main problem in the fact that we don't have good means of locking such "refreshing" of the inode, with respect to other concurrent operations such as reading/writing the file? From a quick look at the code, I didn't find any lock that could be used to keep the inode state consistent across several threads.
I don't remember very well, but an inode should have a lock, and we take that lock every time we read (or update) the file size.
@pwmarcz @dimakuv So as I understand we have 2 problems to solve. One is the propagation of mutations on the inode structure per enclave (multiple threads on the same enclave applying changes) and our case scenario that processes (other enclaves) do not get the updated size when just trying to read the inode structure. So without eager fetch each time I don't see a way of solving the issue (without some ipc sync complex mechanism). As for the in-enclave sync there is a lock in the inode as @pwmarcz mentioned
our case scenario that processes (other enclaves) do not get the updated size when just trying to read the inode structure.
I think that's the same scenario... If we solve the "eager fetch" of inode->size
in one Gramine enclave, then it will automatically help in your two-Gramine-enclaves scenario.
A similar issue happens in Gunicorn, see https://github.com/gramineproject/examples/pull/80.
Basically, Gunicorn has this logic:
- the worker process is supposed to update the ctime of the shared file,
- the parent process verifies that the worker process is alive by checking the ctime of this file every second.
ctime file metadata sharing between processes is currently not supported by Gramine.
(See also a similar issue https://github.com/gramineproject/gramine/issues/254.)