Checkpoint feature tutorial does not work when ran with non-root user
Version of Apptainer
$ apptainer --version
apptainer version 1.3.3
Expected behavior
Expected to be able to reproduce the checkpointing example in the documentation, running all Apptainer commands with a non-privileged user.
Actual behavior
After executing the apptainer checkpoint instance server, the web server running in the instance crashes. Logs from the ~/.apptainer/instances/logs/{host_name}/{usename}/server.err file:
127.0.0.1 - - [02/Sep/2024 10:28:27] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [02/Sep/2024 10:28:32] "GET / HTTP/1.1" 200 -
[2024-09-02T10:28:39.795, 41000, 41003, ERROR] at fileconnlist.cpp:428 in prepareShmList; REASON='JASSERT(fd != -1) failed'
(strerror((*__errno_location ()))) = Read-only file system
area.name = /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
python3.10: Terminating...
Backtrace:
1 jassert_internal::JAssert::~JAssert() in /.singularity.d/libs/libdmtcp.so 0x7f2515e572f1
2 dmtcp::FileConnList::prepareShmList() in /.singularity.d/libs/libdmtcp_ipc.so 0x7f25162c52de
3 dmtcp_FileConnList_EventHook(eDmtcpEvent, _DmtcpEventData_t*) in /.singularity.d/libs/libdmtcp_ipc.so 0x7f25162c68f7
4 dmtcp::PluginManager::eventHook(eDmtcpEvent, _DmtcpEventData_t*) in /.singularity.d/libs/libdmtcp.so 0x7f2515e26e57
5 dmtcp::DmtcpWorker::preCheckpoint() in /.singularity.d/libs/libdmtcp.so 0x7f2515e1dff4
6 in /.singularity.d/libs/libdmtcp.so 0x7f2515e2eab4
7 in /.singularity.d/libs/libdmtcp.so 0x7f2515e30c66
8 in /lib/x86_64-linux-gnu/libpthread.so.0 0x7f2515852fa3
9 clone in /lib/x86_64-linux-gnu/libc.so.6 0x7f25155f506f
Following calls to apptainer checkpoint instance server show the following logs:
INFO: Using checkpoint "example-checkpoint"
Error, computation not in running state. Either a checkpoint is
currently happening or there are no connected processes.
If using the "root" user to run the example, this error doesn't occur, and I'm able to reproduce the example but the restarting part doesn't work reliably (similar to the issue described here).
Steps to reproduce this behavior
Follow the instructions in the documentation. The user running shouldn't be the root user.
DMTCP was installed from source from the tag 3.0.0 in the github repo.
What OS/distro are you running
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
How did you install Apptainer
wget https://github.com/apptainer/apptainer/releases/download/v1.3.3/apptainer_1.3.3_amd64.deb
sudo apt install -y ./apptainer_1.3.3_amd64.deb
Hello, looking to reproduce this. Did you build dmtcp with the --enable-static-libstdcxx flag?
Hello, looking to reproduce this. Did you build
dmtcpwith the--enable-static-libstdcxxflag?
Yes. This is how I built it:
#!/bin/bash
VERSION=3.0.0
apt install git gcc g++ make -y
apt install python3 -y
git clone https://github.com/dmtcp/dmtcp
cd dmtcp
git checkout $VERSION
./configure --enable-static-libstdcxx
make
make check # Optional
make install
echo /usr/local/lib/dmtcp > /etc/ld.so.conf.d/dmtcp.conf
ldconfig
hmm, I can not reproduce this issue, it looks like that it is related to permission issue as shown in the dump trace
[2024-09-02T10:28:39.795, 41000, 41003, ERROR] at fileconnlist.cpp:428 in prepareShmList; REASON='JASSERT(fd != -1) failed'
(strerror((*__errno_location ()))) = Read-only file system
The documentation was updated in apptainer/apptainer-userdocs#300.