UnifyFS icon indicating copy to clipboard operation
UnifyFS copied to clipboard

Multiple servers start up and execute as independent single-task jobs

Open adammoody opened this issue 4 years ago • 20 comments

Hit a corner case that took a while to debug. Since I like to run under totalview, I still prefer to launch the servers by hand. That looks something like this hacky thing:

procs=$SLURM_NNODES
srun -n $procs -N $procs touch /var/tmp/unifyfs.conf
srun -n $procs -N $procs mkdir /dev/shm/unifyfs

export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf
export UNIFYFS_SERVER_LOCAL_EXTENTS=0
export UNIFYFS_SHAREDFS_DIR=/p/lustre2/user1
export UNIFYFS_DAEMONIZE=off

totalview srun -a -n $procs -N $procs `pwd`/bin/unifyfsd

However, in this case, I apparently built without PMI support, and I had forgotten to create a servers hostfile. In that situation each server initializes glb_pmi_rank = 0 and glb_pmi_size = 1. During the run, each server thinks it is server rank 0, and so it generates key/value pairs using that information.

Then I ran a job with two nodes, one server per node, and one client per node. Each client wrote one extent, and the extents do not overlap. Each server inserted one key/value pair into MDHIM while both assumed they were server rank=0. When reading the data back out, a server ends up reading back both keys from MDHIM successfully, but then it thinks both extents are local, since delegator_rank=0 in both key/value pairs. This results in a file that is the correct size, but the read returns "corrupted" data.

Not sure what to do about it yet, but I wanted to document the case to capture the details. Maybe we can make the server host file be a required setting when PMI isn't being used?

Also, what happens if PMI is enabled but then one sets a hostfile?

I did verify things worked again when I created the servers hostfile, which I did with this other hack:

export UNIFYFS_SERVER_HOSTFILE=/p/lustre2/user1/unifyfs_server_hosts
rm -f $UNIFYFS_SERVER_HOSTFILE
echo $SLURM_NNODES > $UNIFYFS_SERVER_HOSTFILE
srun -n $SLURM_NNODES -N $SLURM_NNODES /bin/hostname >> $UNIFYFS_SERVER_HOSTFILE

adammoody avatar Jun 29 '20 19:06 adammoody

I am not sure if my problem is also relevant. In my case (with x86 testbed cluster/slurm), glb_pmi_rank and glb_pmi_size seem to be correctly set (server_pid file is successfully published). But, each mdhim instance is initialized with mdhim_rank=0 and mdhim_comm_size=1. I do not experience this on Summit/Summitdev.

sandrain avatar Jul 01 '20 14:07 sandrain

@sandrain , I'm guessing that might happen if the MPI launch is not working as expected. The server processes don't see the correct environment for the MPI ranks to connect to each other. What is being used to launch the job?

adammoody avatar Jul 01 '20 16:07 adammoody

@adammoody I use unifyfs start to launch servers within a slurm cluster that I configured. I've been doing like that for a while, but it is only recent that I experience this problem. I think I probably lack the understanding of how mdhim is initialized; how does MPI_COMM_WORLD work in mdhim when we launch server without mpi?

sandrain avatar Jul 01 '20 16:07 sandrain

@sandrain When a MPI program is launched standalone (i.e., not by an MPI launcher), the program will see rank=0 and a COMM_WORLD of size 1. Is your SLURM properly configured to launch MPI jobs? The unifyfs utility assumes that srun is configured properly to run MPI apps.

MichaelBrim avatar Jul 01 '20 17:07 MichaelBrim

If you are using openmpi on your cluster, you need to have configured it to have slurm compatibility. See https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps

MichaelBrim avatar Jul 01 '20 18:07 MichaelBrim

@MichaelBrim I think my slurm launches the job correctly (I am using mpich2):

(unifyfs-batch) [root@rage17 rage]$ env | grep SLURM
SLURM_NODELIST=rage[17-22]
SLURM_JOB_NAME=job.sh
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=6
SLURM_JOBID=36
SLURM_TASKS_PER_NODE=32(x6)
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_JOB_ID=36
SLURM_SUBMIT_DIR=/autofs/nccs-svm1_techint/home/hs2/projects/UnifyFS/__run/rage
SLURM_JOB_NODELIST=rage[17-22]
SLURM_JOB_CPUS_PER_NODE=32(x6)
SLURM_CLUSTER_NAME=rage
SLURM_SUBMIT_HOST=rage17
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=6
(unifyfs-batch) [root@rage17 rage]$ srun hostname
rage17
rage20
rage22
rage18
rage19
rage21
(unifyfs-batch) [root@rage17 rage]$ mpirun -ppn 2 hostname | sort
rage17
rage17
rage18
rage18
rage19
rage19
rage20
rage20
rage21
rage21
rage22
rage22

What I am not sure is (pure technical question), how the mdhim initializes the MPI communicator when unifyfsd is running without MPI (!UNIFYFSD_USE_MPI).

https://github.com/LLNL/UnifyFS/blob/4998aabf1450f925bffc2b4f901645a67322bbf4/server/src/unifyfs_metadata.c#L93

In the meta_init_store function, who's responsible to provide the correct MPI_COMM_WORLD, when UNIFYFSD_USE_MPI is not defined, for mdhim to init?

    int rc, ratio;
    MPI_Comm comm = MPI_COMM_WORLD;
    size_t path_len;
    long svr_ratio, range_sz;
    struct stat ss;
    char db_path[UNIFYFS_MAX_FILENAME] = {0};

    if (cfg == NULL) {
        return -1;
    }

    /** setting up options **/


    md = mdhimInit(&comm, db_opts);

sandrain avatar Jul 01 '20 18:07 sandrain

Right, the other thing to consider is to tell SLURM to use the correct startup protocol for MPI. You can change this with an srun --mpi option, as in srun --mpi=pmi2. You can set the default in /etc/slurm/slurm.conf with a MpiDefault=pmi2 entry.

adammoody avatar Jul 01 '20 18:07 adammoody

MDHIM initializes MPI itself if needed: https://github.com/LLNL/UnifyFS/blob/4ac866bd8befb8fecd357bddf7817836be76efda/meta/src/mdhim.c#L104

adammoody avatar Jul 01 '20 18:07 adammoody

MDHIM will initialize MPI if it has not been done yet. See meta/src/mdhim.c:98 . MPI_COMM_WORLD is a global variable that is initialized during MPI_Init()

MichaelBrim avatar Jul 01 '20 18:07 MichaelBrim

Based on the code, we could pass NULL to mdhimInit() for the communicator. It will use MPI_COMM_WORLD (after init) if no communicator is provided.

MichaelBrim avatar Jul 01 '20 18:07 MichaelBrim

@sandrain , to check that your srun and mpich are talking to each other, you can run a simple MPI job where each process prints its rank.

adammoody avatar Jul 01 '20 18:07 adammoody

This is the test (if the slurm is configured fine):

(unifyfs-batch) [root@rage17 test]$ cat mpi.c
#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv)
{
    int rank;
    int size;
    int len;
    char processor[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(NULL, NULL);

    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor, &len);

    printf("[%02d/%02d:%s] hello!\n", rank, size, processor);

    MPI_Finalize();
}
(unifyfs-batch) [root@rage17 test]$ mpicc -o mpitest mpi.c                                     
(unifyfs-batch) [root@rage17 test]$ which mpicc
/usr/lib64/mpich-3.2/bin/mpicc
(unifyfs-batch) [root@rage17 test]$ which mpirun
/usr/lib64/mpich-3.2/bin/mpirun
(unifyfs-batch) [root@rage17 test]$ mpirun -ppn 2 ./mpitest | sort
[00/12:rage17] hello!
[01/12:rage17] hello!
[02/12:rage18] hello!
[03/12:rage18] hello!
[04/12:rage19] hello!
[05/12:rage19] hello!
[06/12:rage20] hello!
[07/12:rage20] hello!
[08/12:rage21] hello!
[09/12:rage21] hello!
[10/12:rage22] hello!
[11/12:rage22] hello!
(unifyfs-batch) [root@rage17 test]$ srun --ntasks=12 --ntasks-per-node=2 ./mpitest | sort
[00/12:rage17] hello!
[01/12:rage17] hello!
[02/12:rage18] hello!
[03/12:rage18] hello!
[04/12:rage19] hello!
[05/12:rage19] hello!
[06/12:rage20] hello!
[07/12:rage20] hello!
[08/12:rage21] hello!
[09/12:rage21] hello!
[10/12:rage22] hello!
[11/12:rage22] hello!
(unifyfs-batch) [root@rage17 test]$ env | grep SLURM
SLURM_NODELIST=rage[17-22]
SLURM_JOB_NAME=job.sh
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=6
SLURM_JOBID=37
SLURM_TASKS_PER_NODE=32(x6)
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_JOB_ID=37
SLURM_SUBMIT_DIR=/autofs/nccs-svm1_techint/home/hs2/projects/UnifyFS/__run/rage
SLURM_JOB_NODELIST=rage[17-22]
SLURM_JOB_CPUS_PER_NODE=32(x6)
SLURM_CLUSTER_NAME=rage
SLURM_SUBMIT_HOST=rage17
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=6

Slurm is configured with MpiDefault=pmi2. Do you see anything suspicious?

sandrain avatar Jul 01 '20 18:07 sandrain

Ok, it looks like slurm and mpich are good to go then.

adammoody avatar Jul 01 '20 19:07 adammoody

Based on the code, we could pass NULL to mdhimInit() for the communicator. It will use MPI_COMM_WORLD (after init) if no communicator is provided.

@MichaelBrim 's suggestion should work. We probably should not be passing a communicator to mdhim anyway if we have not initialized MPI from within UnifyFS.

adammoody avatar Jul 01 '20 19:07 adammoody

I haven't figured out yet, but here is what I've found so far:

Under my slurm environment, I can launch the servers manually (without using the unifyfs start utility). But I can only do with mpirun and manually creating a hostfile (like @adammoody does). Without the hostfile, I observe the same problem of each server thinks that he's the only server with rank 0.

And, I cannot launch the server with srun at all. It dies while executing mdhimInit. This happens when I launch with only a single server:

(unifyfs-batch) [root@rage17 rage]$ srun -N 1  --ntasks=1 --ntasks-per-node=1 /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/bin/unifyfsd -S /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/unifyfsd -H $UNIFYFS_SERVER_HOSTFILE && gdb --pid=$(pidof unifyfsd)
Attaching to process 17709
[New LWP 17713]
[New LWP 17717]
[New LWP 17718]
[New LWP 17719]
[New LWP 17720]
[New LWP 17721]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb599cd6de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 leveldb-1.12.0-11.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgfortran-4.8.5-39.el7.x86_64 libquadmath-4.8.5-39.el7.x86_64 libselinux-2.5-14.1
(gdb) bt
#0  0x00007fb599cd6de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fb59a151452 in pool_pop_timedwait ()
   from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#2  0x00007fb59a1525bd in sched_run ()
   from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#3  0x00007fb59a1483c5 in ABTI_xstream_schedule ()
   from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#4  0x00007fb59a14f87b in ABTD_thread_func_wrapper_sched ()
   from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#5  0x00007fb59a14fdb1 in make_fcontext ()
   from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#6  0x0000000000000000 in ?? ()
(gdb) b mdhimInit
Breakpoint 1 at 0x44c1de: file ../../../../../meta/src/mdhim.c, line 74.
(gdb) c
Continuing.
[New Thread 0x7fb58effd700 (LWP 17733)]

Thread 1 "unifyfsd" hit Breakpoint 1, mdhimInit (appComm=0x7fff9e974bec, opts=0x2292870)
    at ../../../../../meta/src/mdhim.c:74
74              int ret = 0;
(gdb) n
80              if (!opts) {
(gdb) 
94              ret = mlog_open((char *)"mdhim", 0,
(gdb) 
98              if ((ret = MPI_Initialized(&flag)) != MPI_SUCCESS) {
(gdb) 
102             if (!flag) {
(gdb) 
103             fprintf(stderr, "initializing mpi\n");
(gdb) 
105                     ret = MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
(gdb) 

Thread 1 "unifyfsd" received signal SIGPIPE, Broken pipe.
0x00007fb599cd96fd in write () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fb599cd96fd in write () from /lib64/libpthread.so.0
#1  0x00007fb59b147210 in PMIU_writeline () from /usr/lib64/mpich-3.2/lib/libmpi.so.12
#2  0x00007fb59b148034 in PMII_getmaxes.constprop.3 ()
   from /usr/lib64/mpich-3.2/lib/libmpi.so.12
#3  0x00007fb59b148485 in PMI_Init () from /usr/lib64/mpich-3.2/lib/libmpi.so.12
#4  0x00007fb59b10605a in MPID_Init () from /usr/lib64/mpich-3.2/lib/libmpi.so.12
#5  0x00007fb59b06a3ec in MPIR_Init_thread () from /usr/lib64/mpich-3.2/lib/libmpi.so.12
#6  0x00007fb59b06a736 in PMPI_Init_thread () from /usr/lib64/mpich-3.2/lib/libmpi.so.12
#7  0x000000000044c346 in mdhimInit (appComm=0x7fff9e974bec, opts=0x2292870)
    at ../../../../../meta/src/mdhim.c:105
#8  0x0000000000410a40 in meta_init_store (cfg=0x6720a0 <server_cfg>)
    at ../../../../../server/src/unifyfs_metadata.c:150
#9  0x00000000004078bd in main (argc=5, argv=0x7fff9e975288)
    at ../../../../../server/src/unifyfs_server.c:380
(gdb) 

This happens regardless of the communicator we pass to mdhim, i.e., uninitialized MPI_COMM_WORLD or NULL. I will probably look more tomorrow.

sandrain avatar Jul 01 '20 21:07 sandrain

Kind of looks like SLURM might be disconnecting from the process in the PMI exchange. You mentioned you're using PMI2 with SLURM. Maybe double check that you built unifyfs to also use PMI2. If UnifyFS is trying to talk PMIX on the wire but SLURM is expecting PMI2, that might lead to this kind of behavior.

adammoody avatar Jul 01 '20 21:07 adammoody

SLURM might also support --mpi=pmix, but I have less experience with that.

adammoody avatar Jul 01 '20 21:07 adammoody

Oh wait, I'm confusing myself. In this case, it would be mdhim going through MPICH with MPICH talking to SLURM. So forget the above part.

I just remembered another problem. SLURM's PMI2 does not expect a job to call it twice. If the UnifyFS servers are going through SLURM PMI2 to exchange addresses, and then mdhim is calling MPI_Init, which calls SLURM's PMI2 again, I think SLURM bails out.

In that case, you might have better luck if you can switch everything to PMIx. Or avoid calling PMI from UnifyFS, so that the MPI_Init will work when it is called by mdhim. Or fallback to bootstrap the unifyfs servers with MPI in which case mdhim will not try to initialize MPI again.

adammoody avatar Jul 01 '20 21:07 adammoody

I also notice in your above stack trace that MPICH is calling PMI_Init instead of PMI2_Init. However, if I remember correct, I think SLURM's PMI2 plugin is able to support backwards compatibility so that it can also speak the PMI-1 protocol.

adammoody avatar Jul 01 '20 22:07 adammoody

@adammoody Thanks for suggestions. I've got no luck with any approach by far. There is probably something going on with my testbed. One weird thing is that it was working fine without problem and the problem suddenly happened. I will probably try to setup the environment again and see if the problem still exists.

sandrain avatar Jul 02 '20 17:07 sandrain