Seg fault running Assise as local FS
Hi folks,
I am trying to set up Assise to run as a local file system but I'm having trouble getting it to run. I've been able to successfully build Assise, configure storage, run mkfs, and start up the KernFS/SharedFS process. I followed the instructions here to configure Assise to run as a single local file system. When I try to run a program from libfs/tests (I've been using mkdir_user but have tried a few others), the KernFS appears to segfault. I spent some time trying to figure out where it might be occurring without much luck, although it appears to occur before mkdir_user's main function actually runs.
I did make some small changes to Assise, although I don't think they are the cause of the issue. I want to run Assise on a very small emulated PM device (128 MB would be best, a couple GB at most) so I had to reduce the number of inodes and the size of each LibFS's log in order to prevent asserts from failing.
I'm running Assise on a QEMU/KVM virtual machine with 4 cores and Linux kernel 5.1 and at 8GB of RAM. I've tried running it on 128MB, 1GB, 2GB, and 3GB of emulated PM and get the segmentation fault on all of them.
I also tried disabling the DISTRIBUTED compilation flag, but ran into build issues; I can post more details about that if I need to remove this flag to get things to work.
Thanks in advance for your help!
cc'ing Waleed. Waleed, not sure if you get these messages automatically.
On Wed, May 12, 2021 at 1:43 PM hayley-leblanc @.***> wrote:
Hi folks,
I am trying to set up Assise to run as a local file system but I'm having trouble getting it to run. I've been able to successfully build Assise, configure storage, run mkfs, and start up the KernFS/SharedFS process. I followed the instructions here https://github.com/hayley-leblanc/assise#running-as-a-local-filesystem to configure Assise to run as a single local file system. When I try to run a program from libfs/tests (I've been using mkdir_user but have tried a few others), the KernFS appears to segfault. I spent some time trying to figure out where it might be occurring without much luck, although it appears to occur before mkdir_user's main function actually runs.
I did make some small changes to Assise, although I don't think they are the cause of the issue. I want to run Assise on a very small emulated PM device (128 MB would be best, a couple GB at most) so I had to reduce the number of inodes and the size of each LibFS's log in order to prevent asserts from failing.
I'm running Assise on a QEMU/KVM virtual machine with 4 cores and Linux kernel 5.1 and at 8GB of RAM. I've tried running it on 128MB, 1GB, 2GB, and 3GB of emulated PM and get the segmentation fault on all of them.
Thanks in advance for your help!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ut-osa/assise/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQBMQXDM6CRRNF5NZPKODTNLD2NANCNFSM44ZEXYRA .
I've added myself as a watcher, so I should be getting notifications.
@hayley-leblanc : There's no need to disable the DISTRIBUTED flag as it has been deprecated. The steps you followed in the README should be sufficient. Since you've modified the storage configuration, I'd first double-check that you rebuilt both LibFS/KernFS and reran mkfs.sh successfully.
If you already did that, I'll likely need more context to know what might be causing this. Can you rerun KernFS in gdb and share the stack trace? You will need to first recompile KernFS with the -g flag.
I double checked that I cleaned and rebuilt LibFS and KernFS, ran change_dev_size.py, re-ran mkfs.sh, etc. with the new configurations, but I'm still running into the issue. Here's the output from running KernFS in gdb:
Starting program: /usr/bin/numactl -N0 -m0 kernfs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
process 3005 is executing new program: /home/novavm/vmshare/assise/kernfs/tests/kernfs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 3072 MB
[New Thread 0x7fff371ff700 (LWP 3009)]
[New Thread 0x7fff369fe700 (LWP 3010)]
[New Thread 0x7fff361fd700 (LWP 3011)]
[New Thread 0x7fff359fc700 (LWP 3012)]
[New Thread 0x7fff351fb700 (LWP 3013)]
[New Thread 0x7fff349fa700 (LWP 3014)]
[New Thread 0x7fff341f9700 (LWP 3015)]
[New Thread 0x7fff339f8700 (LWP 3016)]
[New Thread 0x7fff331f7700 (LWP 3017)]
Reading root inode with inum: 1fetching node's IP address..
Process pid is 3005
ip address on interface 'lo' is 127.0.0.1
cluster settings:
--- node 0 - ip:127.0.0.1
[New Thread 0x7fff329f6700 (LWP 3020)]
MLFS cluster initialized
[Local-Server] Listening on port 12345 for connections. interrupt (^C) to exit.
Adding connection with sockfd: 0
[New Thread 0x7fff321f5700 (LWP 3031)]
Adding connection with sockfd: 1
RECV <-- MSG_INIT [pid 0]
[New Thread 0x7fff319f4700 (LWP 3032)]
[add_peer_socket():80] Peer connected (ip: 127.0.0.1, pid: 3025)
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x7fff30e0f000
RECV <-- MSG_INIT [pid 2]
Adding connection with sockfd: 2
SEND --> MSG_SHM [paths: /shm_recv_0|/shm_send_0]
start shmem_poll_loop for sockfd 0
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:1 of type:2 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_1|/shm_send_1]
start shmem_poll_loop for sockfd 1
[New Thread 0x7fff30bff700 (LWP 3033)]
RECV <-- MSG_INIT [pid 1]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:2 of type:1 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_2|/shm_send_2]
start shmem_poll_loop for sockfd 2
00000000000000000000000000000001
[New Thread 0x7fff2ffff700 (LWP 3034)]
[New Thread 0x7fff2f7fe700 (LWP 3035)]
Adding connection with sockfd: 3
[New Thread 0x7fff2effd700 (LWP 3048)]
Adding connection with sockfd: 4
RECV <-- MSG_INIT [pid 0]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:3 of type:0 and peer:0x7fff30e0f000
[New Thread 0x7fff2e7fc700 (LWP 3049)]
SEND --> MSG_SHM [paths: /shm_recv_3|/shm_send_3]
Adding connection with sockfd: 5
RECV <-- MSG_INIT [pid 2]
start shmem_poll_loop for sockfd 3
[New Thread 0x7fff2dbff700 (LWP 3050)]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:4 of type:2 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_4|/shm_send_4]
start shmem_poll_loop for sockfd 4
RECV <-- MSG_INIT [pid 1]
[add_peer_socket():98] Established connection with 127.0.0.1 on sock:5 of type:1 and peer:0x7fff30e0f000
SEND --> MSG_SHM [paths: /shm_recv_5|/shm_send_5]
start shmem_poll_loop for sockfd 5
00000000000000000000000000000011
[New Thread 0x7fff2cdff700 (LWP 3051)]
[New Thread 0x7fff2c5fe700 (LWP 3052)]
Thread 17 "kernfs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff2effd700 (LWP 3048)]
0x00007ffff7f4b9dd in init_replication (remote_log_id=remote_log_id@entry=2, peer=0x7ffff746d0c0, begin=begin@entry=644609, size=size@entry=906753, addr=addr@entry=0, end=0x7fff2dc0f020) at ./global/mem.h:36
36 return calloc(1, size);
And the stack trace:
#0 0x00007ffff7f4b9dd in init_replication (
remote_log_id=remote_log_id@entry=2, peer=0x7ffff746d0c0,
begin=begin@entry=644609, size=size@entry=906753, addr=addr@entry=0,
end=0x7fff2dc0f020) at ./global/mem.h:36
#1 0x00007ffff7f4d24b in register_peer_log (peer=0x7fff30e0f000,
find_id=<optimized out>) at distributed/peer.c:271
#2 0x00007ffff7f57d31 in signal_callback (msg=0x7ffff789f008) at fs.c:2389
#3 0x00007ffff7b11e09 in shmem_poll_loop (sockfd=sockfd@entry=3)
at shmem_ch.c:106
#4 0x00007ffff7b121a6 in local_server_thread (arg=<optimized out>)
at shmem_ch.c:339
#5 0x00007ffff7d18609 in start_thread (arg=<optimized out>)
at pthread_create.c:477
#6 0x00007ffff7e54293 in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
It seems your segfault was due to an outdated mkdir_user script. It was calling init_fs() explicitly, which is not needed in the case of Assise (since this function is called automatically by LibFS). I've introduced a patch that addresses this.
Please pull and rebuild LibFS, KernFS, and the tests directory. Let me know if you're still having issues.