ovis Slurm notifier plugin prevents use of Slurm job containers

When the Slurm notifier plugin is used with Slurm's job containers, jobs fail to start. As seen below, the failure is ultimately in a setns() system call. The error happens early during job initialization and is repeatable. I believe the failure is caused by threads created when the ldms libraries are loaded. Two places I see threads in library constructurs are

in cs_init() from ldms.c creating a delete_thread
in zap.c which registers a pthread_atfork handler which creates a zap_event_thread_proc

$ salloc -N1 -p xfer -w gert02
salloc: Granted job allocation 2751
salloc: Waiting for resource configuration
salloc: Nodes gert02 are ready for job
srun: error: gert02: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=2751.interactive
slurmstepd: error: container_p_join: setns failed for /var/opt/cray/persistent/tmpfs/2751/.ns: Invalid argument
slurmstepd: error: container_g_join failed: 2751
salloc: Relinquishing job allocation 2751
salloc: Job allocation 2751 has been revoked.

I haven't tried it, but you should be able to reproduce this by calling ldms_stream_publish followed by setns from a standalone executable. (Only loading the libraries should do it, too.)

Aug 23 '21 17:08 eric-roman

@eric-roman,

I'd like to try to reproduce it, but I need more info:

What is a Slurm's job container? Is it a feature in Slurm that I need to configure to use?
What transport plugin are you using for slurm_notifier?
What is the branch? (OVIS-4?)

Thanks!

Aug 23 '21 18:08 narategithub

@narategithub

It's a Slurm feature that needs to be configured. We saw it when using the job_container/tmpfs plugin. Looking in our slurm.conf I see the line: JobContainerType=job_container/tmpfs The configuration is described in more detail here

A trimmed down version of the job_container.conf used above looks like this:

$ cat /etc/slurm/job_container.conf
NodeName=gert02 BasePath=/var/opt/cray/persistent/tmpfs

We are using the sock transport for the slurm notifier
We're running the OVIS-4.3.7 tag with some site-specific plugins patched in.

Appreciate the quick response!

Aug 23 '21 21:08 eric-roman

@eric-roman Thanks! I'll look into it. Last question: What is the version of Slurm you're using?

Aug 24 '21 02:08 narategithub

@tom95858 @eric-roman

FYI, The error is confirmed. I've tested with slurm 20.11.8 and OVIS-4.3.7 on OGC cygnus cluster.

[salloc] (git-branch: --)
narate@cygnus-08 ~/slurm
$ srun bash
srun: error: cygnus-01: task 0: Exited with exit code 1
slurmstepd: error: container_p_join: setns failed for /slurm-tmp/5/.ns: Invalid argument
slurmstepd: error: container_g_join failed: 5

Without slurm_notifier in the plugstack.conf, I don't see this error. I'll dig more into it tomorrow.

Aug 24 '21 04:08 narategithub

For reference we first found this problem using slurm 20.11.8

On Tue, Aug 24, 2021 at 2:35 AM narategithub @.***> wrote:

@eric-roman https://github.com/eric-roman Thanks! I'll look into it. Last question: What is the version of Slurm you're using?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-904274455, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQMPMNF6OTOSJL6OQWMJ53T6MAQNANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

Aug 24 '21 15:08 eric-roman

@eric-roman I think you're right. The multi-threaded is the cause. According to setns(2) man page:

A multithreaded process may not change user namespace with setns().

I also verified this with a very simple test program that creates an extra thread (that does nothing) before setns(). The setns() error with EINVAL. When I didn't create the extra thread, it joins the mount namespace successfully.

The current nature of ldms/zap needs those threads for application callback ... so I'm not quite sure how we go from here.

@tom95858 Do you think it is wise to implement a socket-only library that talks ldmsd_request/stream to ldmsd w/o using LDMS and zap to work around the issue? I believe slurm_notifier only uses ldmsd_stream send.

Aug 24 '21 19:08 narategithub

This isn't quite a fix, but I did try moving the ldms_xprt_close to the end of send_event in slum_notifier. That seemed to kill off the callback threads.

However the delete_thread threads created when libldms loads were still around. If I understood correctly, those delete_thread threads are related to managing metric sets, so delaying their creation until a metric set is created may be an option for those. (Or move the ldms_xprt functions needed for libldmsd_stream to a separate library?)

Aug 24 '21 20:08 eric-roman

Hi Narate,

I think this may have to do with the fact that there is already an open connection on one of the threads that would become invalid after the setns() call.

A quick look at the code suggests that we never close the connections we create in send_event(). So these connections are left hanging around. When a subsequent setns() occurs, it can't place these open connections in the new namespace and setns() fails.

Please try destroying all transports at the end of send_event() and see if this changes things.

On Tue, Aug 24, 2021 at 1:59 PM narategithub @.***> wrote:

@eric-roman https://github.com/eric-roman I think you're right. The multi-threaded is the cause. According to setns(2) man page:

A multithreaded process may not change user namespace with setns().

I also verified this with a very simple test program that creates an extra thread (that does nothing) before setns(). The setns() error with EINVAL. When I didn't create the extra thread, it joins the mount namespace successfully.

The current nature of ldms/zap needs those threads for application callback ... so I'm not quite sure how we go from here.

@tom95858 https://github.com/tom95858 Do you think it is wise to implement a socket-only library that talks ldmsd_request/stream to ldmsd w/o using LDMS and zap to work around the issue? I believe slurm_notifier only uses ldmsd_stream send.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-904933536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXGAMQS3CGRQYE53FBDT6P23BANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 20:08 tom95858

Pfft, Eric is way ahead of me...

On Tue, Aug 24, 2021 at 2:46 PM Eric Roman @.***> wrote:

This isn't quite a fix, but I did try moving the ldms_xprt_close to the end of send_event in slum_notifier. That seemed to kill off the callback threads.

However the delete_thread threads created when libldms loads were still around. If I understood correctly, those delete_thread threads are related to managing metric sets, so delaying their creation until a metric set is created may be an option for those. (Or move the ldms_xprt functions needed for libldmsd_stream to a separate library?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-904962167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXHKZKQEYXDDR5WZOT3T6QAJVANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 20:08 tom95858

But my speculation is that it's not the threads, it's the connection that owns resources that are not available in the new namespace...

On Tue, Aug 24, 2021 at 2:53 PM Thomas Tucker @.***> wrote:

Pfft, Eric is way ahead of me...

On Tue, Aug 24, 2021 at 2:46 PM Eric Roman @.***> wrote:

This isn't quite a fix, but I did try moving the ldms_xprt_close to the end of send_event in slum_notifier. That seemed to kill off the callback threads.

However the delete_thread threads created when libldms loads were still around. If I understood correctly, those delete_thread threads are related to managing metric sets, so delaying their creation until a metric set is created may be an option for those. (Or move the ldms_xprt functions needed for libldmsd_stream to a separate library?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-904962167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXHKZKQEYXDDR5WZOT3T6QAJVANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 20:08 tom95858

Sure. I'll try it. But closing the transport doesn't destroy the threads though (and it should not, or it is a bug 😅). The setns(fd, CLONE_NEWNS) seems to really not like multiple threads as the following test program succeeded when pthread_create() was not called, and failed when pthread_create() was called:

#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <sys/mount.h>
#include <assert.h>
#include <errno.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <pthread.h>

pthread_t thr;

void cat(const char *path)
{
        FILE *f = fopen(path, "r");
        char buf[BUFSIZ] = "";
        assert(f);
        printf("-- cat %s --\n", path);
        while (fgets(buf, sizeof(buf), f)) {
                printf("%s", buf);
        }
        printf("-------------------\n");
        fclose(f);
}

void *do_nothing(void *arg)
{
        while (1) {
                sleep(1);
        }
        return NULL;
}

int main(int argc, char **argv)
{
        char buf[BUFSIZ] = "";
        int ns_fd, rc;

#if 0
        rc = pthread_create(&thr, NULL, do_nothing, NULL);
        assert(rc == 0); /* pthread_create */
#endif

        printf("before setns\n");
        cat("/tmp/PRIVATE");

        ns_fd = open("ns", O_RDONLY);
        assert(ns_fd >= 0);

        rc = setns(ns_fd, CLONE_NEWNS);
        assert(rc == 0); /* setns */

        printf("after setns\n");
        cat("/tmp/PRIVATE");

        while (1) {
                sleep(1);
        }
        return 0;
}

I noticed that we also have ldms_xprt_term(int sec) that terminates zap threads. I'll try it out as well.

Aug 24 '21 21:08 narategithub

FYI, I tried the following:

diff --git a/ldms/src/sampler/spank/slurm_notifier.c b/ldms/src/sampler/spank/slurm_notifier.c
index d0d196d..cd4ac9d 100644
--- a/ldms/src/sampler/spank/slurm_notifier.c
+++ b/ldms/src/sampler/spank/slurm_notifier.c
@@ -594,6 +594,7 @@ static int send_event(int argc, char *argv[], jbuf_t jb)
                }
                pthread_mutex_unlock(&client->wait_lock);
        }
+       ldms_xprt_term(0);
        return 0;
 }

Didn't work as @eric-roman pointed out that we still have other threads hanging around. I'll try killing off those threads in ldms_xprt_term() and see if it will work.

Aug 24 '21 21:08 narategithub

Where does the above program fail? The call to setns()?

On Tue, Aug 24, 2021 at 3:21 PM narategithub @.***> wrote:

Sure. I'll try it. But closing the transport doesn't destroy the threads though (and it should not, or it is a bug 😅). The setns(fd, CLONE_NEWNS) seems to really not like multiple threads as the following test program succeeded when pthread_create() was not called, and failed when pthread_create() was called:

#define _GNU_SOURCE

#include <stdio.h>

#include <unistd.h>

#include <sys/mount.h>

#include <assert.h>

#include <errno.h>

#include <sched.h>

#include <sys/types.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <pthread.h>

pthread_t thr;

void cat(const char *path)

{
    FILE *f = fopen(path, "r");

    char buf[BUFSIZ] = "";

    assert(f);

    printf("-- cat %s --\n", path);

    while (fgets(buf, sizeof(buf), f)) {

            printf("%s", buf);

    }

    printf("-------------------\n");

    fclose(f);
}

void *do_nothing(void *arg)

{
    while (1) {

            sleep(1);

    }

    return NULL;
}

int main(int argc, char **argv)

{
    char buf[BUFSIZ] = "";

    int ns_fd, rc;
#if 0
    rc = pthread_create(&thr, NULL, do_nothing, NULL);

    assert(rc == 0); /* pthread_create */
#endif
    printf("before setns\n");

    cat("/tmp/PRIVATE");



    ns_fd = open("ns", O_RDONLY);

    assert(ns_fd >= 0);



    rc = setns(ns_fd, CLONE_NEWNS);

    assert(rc == 0); /* setns */



    printf("after setns\n");

    cat("/tmp/PRIVATE");



    while (1) {

            sleep(1);

    }

    return 0;
}

I noticed that we also have ldms_xprt_term(int sec) that terminates zap threads. I'll try it out as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-904985296, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXA6JY5ABUEPZMLDBO3T6QEMNANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 21:08 tom95858

correct. With errno being EINVAL.

Aug 24 '21 21:08 narategithub

I would like to mention that it does succeed if I comment out pthread_create().

Aug 24 '21 21:08 narategithub

The setns() system call allows the calling thread to move into different namespaces. The fd argument is one of the following:

   • a file descriptor referring to one of the magic links in a
     */proc/[pid]/ns/* directory (or a bind mount to such a link);

   • a PID file descriptor (see pidfd_open(2)

https://man7.org/linux/man-pages/man2/pidfd_open.2.html).

   The *nstype* argument is interpreted differently in each case.

Could you please try the same using one of the supported fd types per the above? The one you provided is to a file in the current working directory.

Thanks, Tom

On Tue, Aug 24, 2021 at 3:58 PM narategithub @.***> wrote:

I would like to mention that it does succeed if I comment out pthread_create().

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-905004016, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXE3N4PGV2D222MOMSLT6QIXZANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 22:08 tom95858

The file is a soft link to /proc/<PID>/ns/mnt of the other process. The other process creates the soft link for convenience (or I have to specify PID every time I run). I did try /proc/<PID>/ns/mnt directly and the result is the same.

Aug 24 '21 22:08 narategithub

Do we know which event is preceded by the call to setns()?

On Tue, Aug 24, 2021 at 4:04 PM narategithub @.***> wrote:

The file is a soft link to /proc/<PID>/ns/mnt of the other process. The other process creates the soft link for convenience (or I have to specify PID every time I run). I did try /proc/<PID>/ns/mnt directly and the result is the same.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-905007463, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXAWFJ2YPJLRXAQGWQDT6QJPNANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 22:08 tom95858

Sorry, I mean precedes. Which event occurs before the call to setns() by slurm? The reason I'm asking is that we can defer these events and then deliver them after the setns() has occurred.

On Tue, Aug 24, 2021 at 4:07 PM Thomas Tucker @.***> wrote:

Do we know which event is preceded by the call to setns()?

On Tue, Aug 24, 2021 at 4:04 PM narategithub @.***> wrote:

The file is a soft link to /proc/<PID>/ns/mnt of the other process. The other process creates the soft link for convenience (or I have to specify PID every time I run). I did try /proc/<PID>/ns/mnt directly and the result is the same.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-905007463, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXAWFJ2YPJLRXAQGWQDT6QJPNANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 22:08 tom95858

Not sure .. I'm digging in slurm source tree.

Aug 24 '21 22:08 narategithub

You may be able to infer it by looking at the events that were delivered before the job failed.

On Tue, Aug 24, 2021 at 4:16 PM narategithub @.***> wrote:

Not sure .. I'm digging in slurm source tree.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-905013752, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXHENHZ4YTRP76PKXTTT6QK5LANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 22:08 tom95858

EINVAL The caller is multithreaded and tried to join a new user namespace. So I'm guessing that the setns() call is made after the job changes from root to the uid.So it seems we may have to use this deferral mechanism to deliver all events after the setns has occurred.

On Tue, Aug 24, 2021 at 4:19 PM Thomas Tucker @.***> wrote:

You may be able to infer it by looking at the events that were delivered before the job failed.

On Tue, Aug 24, 2021 at 4:16 PM narategithub @.***> wrote:

Not sure .. I'm digging in slurm source tree.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/778#issuecomment-905013752, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXHENHZ4YTRP76PKXTTT6QK5LANCNFSM5CVBGCZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thomas Tucker, President, Open Grid Computing, Inc.

-- Thomas Tucker, President, Open Grid Computing, Inc.

Aug 24 '21 22:08 tom95858

According to slurmd.log (debug2), setns() was called in the task process (forked from slurmstepd) while in privileged mode. I think it failed because slurmstepd already loaded slurm_notifier before forking the task processes. Then, after the task process is forked, zap.c:init_atfork() and zap_sock.c:init_once() is called again in the task process (by pthread_atfork() registration) to create the zap event threads and zap_sock io thread respectively. Then, the setns() is called and failed due to the multi-threading. If we can move the re-initialization of zap and zap_sock to be after the setns(), I think it would work. Maybe introducing ldms_after_fork() and zap_after_fork() functions so that the application can control when to do the re-initialization after fork? (and remove the pthread_atfork() registration).

Log snippet about the task:

[2021-08-25T09:09:02.856] [15.0] starting 1 tasks
[2021-08-25T09:09:02.857] [15.0] task 0 (12064) started 2021-08-25T09:09:02
[2021-08-25T09:09:02.858] [15.0] debug:  task/none: task_p_pre_launch_priv: task_p_pre_launch_priv: StepId=15.0
[2021-08-25T09:09:02.858] [15.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory
[2021-08-25T09:09:02.858] [15.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory
[2021-08-25T09:09:02.858] [15.0] debug:  Sending launch resp rc=0
[2021-08-25T09:09:02.858] [15.0] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2021-08-25T09:09:02.859] [15.0] error: container_p_join: setns failed for /slurm-tmp/15/.ns: Invalid argument
[2021-08-25T09:09:02.859] [15.0] error: container_g_join failed: 15
[2021-08-25T09:09:02.860] [15.0] debug2: (12064) __del__:940 Unloading slurm_notifier

[2021-08-25T09:09:02.861] [15.0] task 0 (12064) exited with exit code 1.

Job full log as a reference:

[2021-08-25T09:09:02.827] [15.0] debug2: setup for a launch_task
[2021-08-25T09:09:02.827] [15.0] debug:  jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
[2021-08-25T09:09:02.827] [15.0] debug2: profile signaling type Task
[2021-08-25T09:09:02.827] [15.0] debug:  Message thread started pid = 12050
[2021-08-25T09:09:02.827] [15.0] debug2: hwloc_topology_init
[2021-08-25T09:09:02.828] [15.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurm/d/hwloc_topo_whole.xml) found
[2021-08-25T09:09:02.828] [15.0] debug:  CPUs:4 Boards:1 Sockets:1 CoresPerSocket:4 ThreadsPerCore:1
[2021-08-25T09:09:02.828] debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
[2021-08-25T09:09:02.830] [15.0] debug:  task/none: init: task NONE plugin loaded
[2021-08-25T09:09:02.830] [15.0] cred/munge: init: Munge credential signature plugin loaded
[2021-08-25T09:09:02.830] [15.0] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2021-08-25T09:09:02.830] [15.0] debug:  mpi type = pmix
[2021-08-25T09:09:02.832] [15.0] debug:  mpi/pmix_v1: init: PMIx plugin loaded
[2021-08-25T09:09:02.832] [15.0] debug2: Before call to spank_init()
[2021-08-25T09:09:02.832] [15.0] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2021-08-25T09:09:02.834] [15.0] debug2: (12050) __init__:933 Loading slurm_notifier

[2021-08-25T09:09:02.834] [15.0] debug:  spank: /etc/slurm/plugstack.conf:3: Loaded plugin libslurm_notifier.so
[2021-08-25T09:09:02.834] [15.0] debug2: (12050) __context_str:659 make_step_init_data: cygnus-01.ogc.int(slurmstepd),15,0
[2021-08-25T09:09:02.834] [15.0] debug2: (12050) make_step_init_data:722 SUBSCRIBER_DATA '{}'.

[2021-08-25T09:09:02.835] [15.0] debug2: (12050) slurm_spank_init:853 {"schema":"slurm_step_data","event":"step_init","timestamp":1629900542,"context":"cygnus-01.ogc.int(slurmstepd),15,0","data":{"subscriber_data":{},"job_name":"interactive","job_user":"narate","job_id":15,"nodeid":0,"step_id":0,"alloc_mb":0,"ncpus":4,"nnodes":1,"local_tasks":1,"total_tasks":1 }}[2021-08-25T09:09:02.835] [15.0] debug2: (12050) setup_clients:442 timeout (null) io_timeout 5

[2021-08-25T09:09:02.835] [15.0] debug2: spank: libslurm_notifier.so: init = 0
[2021-08-25T09:09:02.835] [15.0] debug2: After call to spank_init()
[2021-08-25T09:09:02.835] [15.0] debug:  mpi type = pmix
[2021-08-25T09:09:02.835] [15.0] debug:  mpi/pmix_v1: p_mpi_hook_slurmstepd_prefork: (null) [0]: mpi_pmix.c:160: start
[2021-08-25T09:09:02.835] [15.0] debug:  mpi/pmix_v1: pmixp_info_nspace_usock: setup sockets
[2021-08-25T09:09:02.850] [15.0] debug:  mpi/pmix_v1: pmixp_libpmix_job_set: cygnus-01 [0]: pmixp_client.c:702: task initialization
[2021-08-25T09:09:02.851] [15.0] debug:  mpi/pmix_v1: _agent_thread: cygnus-01 [0]: pmixp_agent.c:274: Start agent thread
[2021-08-25T09:09:02.851] [15.0] debug:  mpi/pmix_v1: pmixp_agent_start: cygnus-01 [0]: pmixp_agent.c:426: agent thread started: tid = 47842848020224
[2021-08-25T09:09:02.851] [15.0] debug:  mpi/pmix_v1: pmixp_agent_start: cygnus-01 [0]: pmixp_agent.c:431: timer thread started: tid = 47843047573248
[2021-08-25T09:09:02.851] [15.0] debug:  mpi/pmix_v1: _pmix_timer_thread: cygnus-01 [0]: pmixp_agent.c:312: Start timer thread
[2021-08-25T09:09:02.851] [15.0] debug2: hwloc_topology_load
[2021-08-25T09:09:02.854] [15.0] debug2: hwloc_topology_export_xml
[2021-08-25T09:09:02.854] [15.0] debug2: Entering _setup_normal_io
[2021-08-25T09:09:02.855] [15.0] debug2: io_init_msg_write_to_fd: entering
[2021-08-25T09:09:02.855] [15.0] debug2: io_init_msg_write_to_fd: msg->nodeid = 0
[2021-08-25T09:09:02.855] [15.0] debug2: io_init_msg_write_to_fd: leaving
[2021-08-25T09:09:02.855] [15.0] debug2: Leaving  _setup_normal_io
[2021-08-25T09:09:02.855] [15.0] debug:  IO handler started pid=12050
[2021-08-25T09:09:02.856] [15.0] debug level is 'error'
[2021-08-25T09:09:02.856] [15.0] starting 1 tasks
[2021-08-25T09:09:02.857] [15.0] task 0 (12064) started 2021-08-25T09:09:02
[2021-08-25T09:09:02.858] [15.0] debug:  task/none: task_p_pre_launch_priv: task_p_pre_launch_priv: StepId=15.0
[2021-08-25T09:09:02.858] [15.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory
[2021-08-25T09:09:02.858] [15.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory
[2021-08-25T09:09:02.858] [15.0] debug:  Sending launch resp rc=0
[2021-08-25T09:09:02.858] [15.0] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2021-08-25T09:09:02.859] [15.0] error: container_p_join: setns failed for /slurm-tmp/15/.ns: Invalid argument
[2021-08-25T09:09:02.859] [15.0] error: container_g_join failed: 15
[2021-08-25T09:09:02.860] [15.0] debug2: (12064) __del__:940 Unloading slurm_notifier

[2021-08-25T09:09:02.861] [15.0] task 0 (12064) exited with exit code 1.
[2021-08-25T09:09:02.862] [15.0] debug2: (12050) __context_str:659 make_task_exit_data: cygnus-01.ogc.int(slurmstepd),15,0
[2021-08-25T09:09:02.862] [15.0] debug2: (12050) slurm_spank_task_exit:896 {"schema":"slurm_task_data","event":"task_exit","timestamp":1629900542,"context":"cygnus-01.ogc.int(slurmstepd),15,0","data":{"job_id":15,"step_id":0,"task_id":0,"task_global_id":0,"task_pid":12064,"nodeid":0,"task_exit_status":256 }}
[2021-08-25T09:09:02.862] [15.0] debug2: (12050) setup_clients:442 timeout (null) io_timeout 5

[2021-08-25T09:09:02.862] [15.0] debug2: spank: libslurm_notifier.so: task_exit = 0
[2021-08-25T09:09:02.862] [15.0] debug:  task/none: task_p_post_term: task_p_post_term: StepId=15.0, task 0
[2021-08-25T09:09:02.862] [15.0] debug2: step_terminate_monitor will run for 60 secs
[2021-08-25T09:09:02.862] [15.0] debug:  step_terminate_monitor_stop signaling condition
[2021-08-25T09:09:02.862] [15.0] debug2: step_terminate_monitor is stopping
[2021-08-25T09:09:02.862] [15.0] debug2: switch/none: switch_p_job_postfini: Sending SIGKILL to pgid 12050
[2021-08-25T09:09:02.863] [15.0] debug:  Waiting for IO
[2021-08-25T09:09:02.863] [15.0] debug:  Closing debug channel
[2021-08-25T09:09:02.863] [15.0] debug:  IO handler exited, rc=0
[2021-08-25T09:09:02.863] [15.0] debug2: Aggregated 1 task exit messages
[2021-08-25T09:09:02.864] [15.0] debug2: Before call to spank_fini()
[2021-08-25T09:09:02.864] [15.0] debug2: (12050) __context_str:659 make_step_exit_data: cygnus-01.ogc.int(slurmstepd),15,0
[2021-08-25T09:09:02.864] [15.0] debug2: (12050) slurm_spank_exit:912 {"schema":"slurm_step_data","event":"step_exit","timestamp":1629900542,"context":"cygnus-01.ogc.int(slurmstepd),15,0","data":{"job_id":15,"nodeid":0,"step_id":0 }}
[2021-08-25T09:09:02.864] [15.0] debug2: (12050) setup_clients:442 timeout (null) io_timeout 5

[2021-08-25T09:09:02.864] [15.0] debug2: spank: libslurm_notifier.so: exit = 0
[2021-08-25T09:09:02.864] [15.0] debug2: (12050) __del__:940 Unloading slurm_notifier

[2021-08-25T09:09:02.864] [15.0] debug2: After call to spank_fini()
[2021-08-25T09:09:02.864] [15.0] debug2: Rank 0 has no children slurmstepd
[2021-08-25T09:09:02.864] [15.0] debug2: _one_step_complete_msg: first=0, last=0
[2021-08-25T09:09:02.866] [15.0] debug2:   false, shutdown
[2021-08-25T09:09:02.866] [15.0] debug:  Message thread exited
[2021-08-25T09:09:02.866] [15.0] debug:  mpi/pmix_v1: fini: cygnus-01 [0]: mpi_pmix.c:148: cygnus-01: call fini()
[2021-08-25T09:09:02.866] [15.0] debug:  mpi/pmix_v1: _conn_readable: cygnus-01 [0]: pmixp_agent.c:101:     false, shutdown
[2021-08-25T09:09:02.866] [15.0] debug:  mpi/pmix_v1: _conn_readable: cygnus-01 [0]: pmixp_agent.c:101:     false, shutdown
[2021-08-25T09:09:02.866] [15.0] debug:  mpi/pmix_v1: _conn_readable: cygnus-01 [0]: pmixp_agent.c:101:     false, shutdown
[2021-08-25T09:09:02.866] [15.0] debug:  mpi/pmix_v1: _agent_thread: cygnus-01 [0]: pmixp_agent.c:302: agent thread exit
[2021-08-25T09:09:02.867] [15.0] debug:  mpi/pmix_v1: pmixp_stepd_finalize: cygnus-01 [0]: pmixp_server.c:478: Remove PMIx plugin usock
[2021-08-25T09:09:02.867] [15.0] debug:  mpi/pmix_v1: pmixp_info_nspace_usock: setup sockets
[2021-08-25T09:09:02.867] [15.0] done with job

Aug 25 '21 14:08 narategithub

@tom95858 @eric-roman please see #780 . I've tested that patch with job_container.

Aug 25 '21 16:08 narategithub

Cool. I took a quick look at the MR and sounds like you've got something that will work. I expect to take give it a try by the end of the week.

I did want to ask, in that MR do you need atomic operations to manage delete_thread_initialized or is there a suitable lock was held while it is accessed?

Aug 25 '21 17:08 eric-roman

@eric-roman that's a good point. Thanks for pointing that out. I should put a lock around the initialization since it could be racy when the new process create endpoints from many threads.

Aug 25 '21 17:08 narategithub

@eric-roman an independent system service to identify new PIDs in included #758 which is decoupled from slurm (it follows kernel net-link events instead). It would be interesting to know if it works in your environment independent of containers. It generates messages to ldmsd intended to be compatible with those generated by the slurm plugin. There is a related linux_proc_sampler plugin also intended to be compatible with the slurm plugin messages.

Aug 25 '21 18:08 baallan

@eric-roman is this resolved? Could you please confirm?

Sep 14 '22 05:09 tom95858

Last I tested this issue had been fixed. Ok to close.

Sep 19 '22 16:09 eric-roman

ovis ovis copied to clipboard

Slurm notifier plugin prevents use of Slurm job containers

ovis
ovis copied to clipboard