UnifyFS
                                
                                 UnifyFS copied to clipboard
                                
                                    UnifyFS copied to clipboard
                            
                            
                            
                        ERROR: failed to write shared server hostfile
Hi.
I installed the latest develop version of unifyfs using spack.
When I try to start the unifyfs server deamon using the following command:
unifyfs start --share-dir=/home/[my user]/[some dir] &
I get the following error:
ERROR: no supported resource manager detected
I tried it both with the dedicated slurm installed while installing unifyfs and with my default slurm installation.
The command I used to install unifyfs is:
spack install unifyfs +auto-mount +fortran +hdf5 +pmi ^openmpi +cxx_exceptions fabrics=verbs +legacylaunchers +pmi schedulers=slurm +thread_multiple +vt
and am using [email protected] and the [email protected] compiler.
Take note that I did install unifyfs with openmpi with slurm and fabrics support.
Here is the list of loaded spack packages and I used with their variants:
spack find --loaded --variants
==> 44 installed packages
-- linux-centos7-x86_64 / [email protected] -----------------------------
[email protected]~debug~valgrind
bmi@develop
[email protected]+atomic+chrono~clanglibcpp~container~context~coroutine cxxstd=98 +date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout visibility=hidden +wave
[email protected]+shared
[email protected]~darwinssl~gssapi~libssh~libssh2~nghttp2
[email protected]+libbsd
[email protected] build_type=RelWithDebInfo +shared
[email protected]
[email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz
[email protected]~libmount patches=c325997b72a205ad1638bb3e3ba0e5b73e3d32ce63b2d0d3282f3e3a2ff4663c tracing=none
[email protected] build_type=RelWithDebInfo ~test
[email protected]~cxx~debug~fortran~hl+mpi+pic+shared~szip~threadsafe
[email protected]~cairo~cuda~gl+libxml2~nvml+pci+shared
[email protected]
[email protected] build_type=RelWithDebInfo +shared
[email protected]
[email protected] fabrics=sockets,tcp,udp ~kdreg
[email protected]
[email protected]
[email protected] patches=b185b1ebaea7f8ae74d58c828eb9008cff7c21431b6041aa0de072cb797c77a8
[email protected]
[email protected]
[email protected]
[email protected]~python
[email protected]
[email protected]+shared
[email protected]+bmi+boostsys build_type=RelWithDebInfo ~cci~mpi+ofi patches=34fc95b3599c74a8cece6e873cfdc8bc0afe2dc0deabb6e2d11ea2a93f0cebf5 +selfforward+shared+sm~udreg+verbose
[email protected]
[email protected]~symlinks+termlib
[email protected]
[email protected]~cuda+cxx_exceptions fabrics=verbs ~java+legacylaunchers~memchecker+pmi schedulers=slurm ~sqlite3+thread_multiple+vt
[email protected]+systemcerts
[email protected]~jit+multibyte+utf
[email protected]+cpanm+shared+threads
[email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4~uuid+zlib
rdma-core@20 build_type=RelWithDebInfo
[email protected]
slurm@18-08-0-1~gtk~hdf5~hwloc~mariadb~pmix+readline
[email protected] build_type=RelWithDebInfo patches=c9cfecb1f7a623418590cf4e00ae7d308d1c3faeb15046c2e5090e38221da7cd +pic+shared
[email protected]~column_metadata+fts~functions~rtree
[email protected]
unifyfs@develop+auto-mount+fortran+hdf5+pmi~pmix
[email protected]
[email protected]+optimize+pic+shared
System information
| Type | Version/Name | 
|---|---|
| Operating System | Centos 7 | 
| OS Version | Linux 3.10.0-327.22.2.el7.x86_64 | 
| Architecture | x86-64 | 
| UnifyFS Version | develop | 
As allways, your help is very much appreciated :)
Ok so it was my bad, I forgot to allocate resources using slurm.
This error message is a bit misleading. I think it will be more informative if unifyfs will output an error stating that there is no current allocation in such cases.
Anyway, now I get the following error:
ERROR: failed to write shared server hostfile
What should i do now? Is there any way to get more information from unifyfs?
@Mosseridan Are you running with multiple server nodes, or a single node? The server shared directory path needs to be an existing directory on a file system shared across all server nodes.
Hi @MichaelBrim I tried running it on both single and multiple nodes and the server shared directory I used is under my home directory which is accessible to all nodes in my allocation.
This error message is a bit misleading. I think it will be more informative if unifyfs will output an error stating that there is no current allocation in such cases.
Agreed, I'll add it to my list to make these error messages more clear.
@Mosseridan, I recreated your environment as close as I could and so far have only been able to replicate this error in one condition; when passing a relative path to --share-dir or using ~ to represent my home directory.
For example:
export UNIFYFS_DAEMONIZE=off
unifyfs start -d --share-dir=~/somedir &
Results in:
## options from the command line ##
cleanup:	0
consistency:	LAMINATED
mountpoint:	(null)
script:	(null)
share_dir:	~/somedir
server:	(null)
stage_in:	(null)
stage_out:	(null)
## job allocation (2 nodes) ##
<node1 name>
<node2 name>
ERROR: failed to write shared server hostfile
I get passed this when using the full path for the --share-dir option.
I'll keep looking into this, but wanted to verify that you're not using a relative path here?
What should i do now? Is there any way to get more information from unifyfs?
The -d option prints out a little more information with unifyfs start and unifyfs terminate, but not very helpful here.
Also, setting the UNIFYFS_LOG_DIR=<path> and UNIFYFS_LOG_VERBOSITY=5 environment variables can provide more detailed logs for the servers. However, this again might not help here, as the servers will not have attempted to start yet when the ERROR: failed to write shared server hostfile error is reached.
Hi. Yes I am using the full path to the shared directory. I have tried running with the following commands:
export UNIFYFS_LOG_VERBOSITY=5
export UNIFYFS_LOG_DIR=/home/idanmos/unifyfs_test/test_dir/log
unifyfs start -d --share-dir=/home/idanmos/unifyfs_test/test_dir/share_dir &
but as you suggested, no log output was emmited.
the stdout emmited was:
## options from the command line ##
cleanup:	0
consistency:	LAMINATED
mountpoint:	(null)
script:	(null)
share_dir:	/home/idanmos/unifyfs_test/test_dir/share_dir
server:	(null)
stage_in:	(null)
stage_out:	(null)
[idanmos@usersrv2 test_dir]$ 
## job allocation (2 nodes) ##
node035
node036
ERROR: failed to write shared server hostfile
Is there any more useful information i could give you?
@Mosseridan There's really only two cases that should produce this error.
- The share directory is not provided (which doesn't apply since you do pass it)
- We fail to fopen("/path/to/share_dir/unifyfsd.hosts") for write access. This should only fail if either /path/to/share_dir does not exist, or you don't have write permission for the directory (which is doubtful for your home directory).
This is what i figured. But the directory dose exist and my user has write privileges for it. This wierd.
@Mosseridan , for some more clues on debugging this, can you please add the fprintf()  into your write_hostfile() function within util/unifyfs/src/unifyfs-rm.c as shown below, rebuild, and run again:
static int write_hostfile(unifyfs_resource_t* resource,
                          unifyfs_args_t* args)
{
    int ret = 0;
    size_t i;
    FILE* fp = NULL;
    char hostfile[UNIFYFS_MAX_FILENAME];
    if (NULL == args->share_dir) {
        return -EINVAL;
    }
    snprintf(hostfile, sizeof(hostfile), "%s/unifyfsd.hosts",
             args->share_dir);
    fp = fopen(hostfile, "w");
    if (!fp) {
        fprintf(stderr, "ERROR: failed to write shared server hostfile: '%s' (%s)\n", hostfile, strerror(errno));
        return -errno;
    }
This should print the path name it is trying to open as well as the error message text.