lbann icon indicating copy to clipboard operation
lbann copied to clipboard

hwloc version with openmpi

Open jbalma opened this issue 5 years ago • 10 comments

Using the default spack environment and installation as described here: https://lbann.readthedocs.io/en/latest/building_lbann.html

$ cd lbann/spack_environments/users/llnl_lc/x86_64_cuda/ $ spack install $ spack env loads $ source ./loads $ spack install lbann +gpu +nccl

==> Error: An unsatisfiable version constraint has been detected for spec:

[email protected]%[email protected]~cairo~cuda~gl+libxml2~nvml+pci+shared arch=linux-rhel7-ivybridge

while trying to concretize the partial spec:

[email protected]%[email protected]~cuda+cxx_exceptions fabrics=none ~java~legacylaunchers~memchecker~pmi schedulers=none ~sqlite3~thread_multiple+vt hel7-ivybridge
    ^numactl
        ^autoconf
            ^[email protected]%[email protected]+sigsegv arch=linux-rhel7-ivybridge
                ^libsigsegv
            ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-rhel7-ivybridge
                ^[email protected]%[email protected] arch=linux-rhel7-ivybridge
                    ^[email protected]%[email protected] arch=linux-rhel7-ivybridge
                        ^[email protected]%[email protected]~symlinks~termlib arch=linux-rhel7-ivybridge
                            ^[email protected]%[email protected] arch=linux-rhel7-ivybridge
        ^automake
        ^libtool

openmpi requires hwloc version :1.999, but spec asked for 2.0.2 To load this environment, type: source ./loads

Sounds like this is an issue with openmpi maybe? Is there any way to change the spec to support mpich or mvapich?

Is there any work around to get the default spec working without building from source?

jbalma avatar Feb 06 '20 19:02 jbalma

I have noticed this, too. That has to be a Spack bug because Open-MPI doesn't have a version 1.999 in its package file (spack edit openmpi).

We have an in-progress PR to overhaul the Spack stuff, #1414. It's still being tested but it might do better here; I'm not sure.

Unless you're using the OSX externals file, I don't see where Open-MPI comes into the spec. Can you attach/post the complete concretization?

benson31 avatar Feb 06 '20 19:02 benson31

We have a PR on spack to fix this https://github.com/spack/spack/pull/14760/files

bvanessen avatar Feb 06 '20 20:02 bvanessen

Thanks @benson31 @bvanessen , I was able to resolve the issue now.

Long story short - I'm new to spack, and needed to follow the directions from https://spack.readthedocs.io/en/latest/getting_started.html#spack-on-cray .

After setting up these two files to use modules instead of paths, things started working.

~/.spack/cray/packages.yaml ~/.spack/cray/compilers.yaml

I'm in the process of building up a repo to make a reproducible set of build scripts for Cray XC and vanilla cluster environments. Will update here when that's ready.

Thanks for the help.

-Jake

jbalma avatar Feb 07 '20 23:02 jbalma

I'm getting this issue with hwloc instead:

https://github.com/spack/spack/issues/7938#issuecomment-584387960

den-run-ai avatar Feb 10 '20 22:02 den-run-ai

Oh, so problem is that the version recommended by spack (=1.999) does not actually exist :(

So this seems to work: spack install lbann ^[email protected]

den-run-ai avatar Feb 10 '20 22:02 den-run-ai

I made an error in my above comment -- there is no version 1.999 of hwloc, not in Spack-land, not IRL. You can see the list of known hwloc versions here. You can see the erroneous line of the Open-MPI spack recipe here.

AFAIK, there's no reason to prefer OMPI over MPICH or MVAPICH2. You might try tweaking things to use those instead, or just spack edit openmpi and remove the offending line.

benson31 avatar Feb 10 '20 22:02 benson31

@benson31 hwloc 1.999 is also mentioned in the error message

den-run-ai avatar Feb 10 '20 22:02 den-run-ai

So this error message is auto-generated by Spack:

==> Cannot find version 1.999 in url_list
==> Error: FetchError: All fetchers failed for spack-stage-hwloc-1.999-cej27e2n5nthw3m7gjisqbnmqjmgk6sb

What it's telling you is that 1.999 is not a valid version of hwloc. Then this message:

openmpi requires hwloc version :1.999, but spec asked for 2.1.0

is also from spack, and that's because of depends_on([email protected]) here. If neither of those, which error message do you mean?

benson31 avatar Feb 10 '20 22:02 benson31

@jbalma Were you trying to install on Cori or Cori-GPU. Also, can you try the updated instructions to see if they simplify life.

bvanessen avatar Feb 11 '20 22:02 bvanessen

@bvanessen - I'm trying to build on an internal XC . It should map to Cori (but probably not Cori-GPU - working on another general cluster script that should work there). I'll try getting through the build there with the updated instructions and post the results here.

jbalma avatar Feb 12 '20 17:02 jbalma