TCLB
TCLB copied to clipboard
[internal] New execution interface and development plan
As things are getting hard to maintain I propose a change of overall approach of configuration and execution of TCLB.
These are notes for a possible direction of development
- Integrate TCLB_cluster into the main repo
- Make a common wrapper script for running all calculations
- The model selection would be based on something line
<CLBConfig model="..."> - It would run both native and slurm
- It would run the code coupling (eg. ESYS/TCLB) based on best practices.
- Could generate batch script (without submitting it to the queue).
- The model selection would be based on something line
- Make machine specific alternation to configuration in a mechanism similar to the one done in TCLB_cluster
- Allow for easier execution of tests on different machines/architectures
- Make configuration more persistent (now you have to remember the
./configureoptions that you used on specific machine) (cmakehas a similar mechanism, and can be considered as a solution)
The plan would be to:
- Close V6.3 including DEM and all nice proposed features (#186 #198)
- Begin V6.4 or V7.0 with #157 and maybe #198 + #125
- make the wrappers for configuration and execution and do #124 at the same time.
I would vote for outer layer be done in python/bash/ruby as those are designed for it. It would also make integration into something "bigger" easier?
I've successfully created docker.io image with test suit for TCLB. https://cloud.docker.com/repository/registry-1.docker.io/mdzik/tclb_testenv
so docker pull mdzik/tclb_testenv should work
I done some docker images here: https://hub.docker.com/r/cfdgo/tclb If we want to make docker an option, we should do it properly. It's not easy as in docker there is no multiple-inheritance. This means that one cannot make something that inherist from both (example) cuda image and R image. I did a configuration which works with the official nvidia cuda image some time ago, but I don't know if it is very useful. It doesn't work with GPU on windows. And to work with GPU on linux you have to have a non-standard installation of docker.
And on the side note non of the clusters work with docker. So I think docker is generally good, but only for testing. But testing GPU will still be a problem.
The dockerfiles are at feature/docker branch.
Multiple inheritances is (i think) done by docker-compose. Using docker images might be useful if one intends to do something inside Amazon AWS, but I only intend to create an environment for local tests evaluation, hence travis inheritance. I would like to finally pull csf model, but tests are holding me back ;)
(nvidia-docker is workable with AWS)
I've done some digging
One of the "proper" ways to distribute could be singularity image (https://sylabs.io/singularity/). It is somehow close to docker, but in scientific software in mind.
- it is supported on the Prometheus, it seems that it supports nvidia/cuda
- it is supported (and encouraged) at the ICM
- it could be nice to have integrated TCLB+*MPI+FlyingBalls as a package, with sorted out dependencies and inside preconfigured env. Build elsewhere
questions:
- GPU performance (i could check that on K20@Cyfronet and V100@ICM)
- MPMD performance ?
- how does it help? :)
- other?
@mdzik Could you make a proof-of-concept Singularity container with TCLB code? Let's say: OpenMPI+GCC, no fancy stuff. Let's start without CUDA, and progress from that.
I'll try
The goal is to have (as far as I feel syntax):
$ module load singularity
$ singularity pull somehost/tclb.sif
$ singularity run tclb-gcc.sif d2q9 ~/test/karman.xml
Nie summary: https://tin6150.github.io/psg/blogger_container_hpc.html
hay, i've got this
singularity pull library://mdzik/tclb/tclb:latest
singularity verify ./tclb_latest.sif
then to use build-in TCLB:
singularity exec --nv ./tclb_latest.sif /opt/TCLB_gpu/TCLB/CLB/d2q9/main /opt/TCLB_gpu/TCLB/example/flow/2d/karman.xml
or to use in-container shell, with all Rtools etc, to build local tclb:
singularity -nv shell ./tclb_latest.sif
How do we proceed? :)
@mdzik Looks more and more viable. Did you test is on prometheus? Could these images be build on Travis-CI and published?
What's the route to test it on my computer?
add 1 - not yet, but singularity is there, so it should be ok add 2 - yes, but might need some scripting (some of them are avealible from singularity)
czw., 20 lut 2020 o 04:24 Łukasz Łaniewski-Wołłk [email protected] napisał(a):
@mdzik https://github.com/mdzik Looks more and more viable. Did you test is on prometheus? Could these images be build on Travis-CI and published?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CFD-GO/TCLB/issues/214?email_source=notifications&email_token=AA3A5R32UIC36BTXXDGLIE3RDXZWXA5CNFSM4IVQNDF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKSZHQ#issuecomment-588590238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3A5R5NWV5HQ4VTJXX5RO3RDXZWXANCNFSM4IVQNDFQ .
There are doubts whether the singularity handles the high-performance Open MPI point-to-point messaging module in a right way.
@mdzik please check the how it runs without singularity.
singularity run-wrapper: https://github.com/CFD-GO/TCLB_cluster/pull/6
p/run d2q9 example/flow/2d/karman.xml cat slurm-6800.out
###### Nodes: #######
rysy-n1.icm.edu.pl
###### Loading modules #######
###### --------------- #######
Executing command:
singularity exec --nv /home/ggruszcz/TCLB/tclb_latest.sif /home/ggruszcz/TCLB/CLB/d2q9/main example/flow/2d/karman.xml
###### --------------- #######
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[51802,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: rysy-n1
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
MPMD: TCLB: local:0/1 work:0/1 --- connected to:
[ ] #### : -------------------------------------------------------------------------
[ ] #### : - CLB version: v6.0-beta-1645-g39d9bf8 -
[ ] #### : - Model: d2q9 -
[ ] #### : -------------------------------------------------------------------------
[ ] #### : Setting output path to: karman
[ 0] ---- : Selecting device 0/1
[ 0] warning ! No "Units" element in config file
[ ] ==== : Mesh size in config file: 1024x100x1
[ ] ---- : Global lattice size: 1024x100x1
[ ] ==== : Max region size: 102400. Mesh size 102400. Overhead: 0%
[ ] ---- : Local lattice size: 1024x100x1
[ ] ---- : Threads | Action
[ ] ---- : 32x16 | Primal , NoGlobals , BaseIteration
[ ] ---- : 32x16 | Tangent , NoGlobals , BaseIteration
[ ] ---- : 32x16 | Optimize , NoGlobals , BaseIteration
[ ] ---- : 32x16 | SteadyAdjoint , NoGlobals , BaseIteration
[ ] ---- : 32x16 | Primal , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | Tangent , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | Optimize , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | SteadyAdjoint , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | Primal , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | Tangent , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | Optimize , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | SteadyAdjoint , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | Primal , NoGlobals , BaseInit
[ ] ---- : 32x16 | Tangent , NoGlobals , BaseInit
[ ] ---- : 32x16 | Optimize , NoGlobals , BaseInit
[ ] ---- : 32x16 | SteadyAdjoint , NoGlobals , BaseInit
[ ] ---- : 32x16 | Primal , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | Tangent , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | Optimize , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | SteadyAdjoint , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | Primal , OnlyObjective , BaseInit
[ ] ---- : 32x16 | Tangent , OnlyObjective , BaseInit
[ ] ---- : 32x16 | Optimize , OnlyObjective , BaseInit
[ ] ---- : 32x16 | SteadyAdjoint , OnlyObjective , BaseInit
[ ] #### : [0] Cumulative allocation of 14853696 b (14.9 MB)
[ ] ---- : Creating geom size:102400
[ ] #### : Setting output path to: karman
[ ] #### : Setting output path to: output/karman
[ ] ---- : loading geometry ...
[ ] ---- : Setting number of zones to 3
[ ] ---- : Setting VelocityX in zone (-1) to 0.01 (0.010000)
[ ] ---- : Setting Viscosity to 0.02 (0.020000)
[ ] ---- : [0] Settings [viscosity] to 0.020000
[ ] ---- : [0] Settings [one over relaxation time] to 1.785714
[ ] ---- : [0] Settings [MRT Sx] to -0.785714
[ 0] WARNING ! Unknown setting Smag
[ 0] WARNING ! Unknown setting PressDiffInObj
[ 0] WARNING ! Unknown setting EOSScale
[ 0] WARNING ! Unknown setting Tension
[ 0] WARNING ! Unknown setting Coriolis
[ 0] WARNING ! Unknown setting SolidAlfa
[ 0] WARNING ! Unknown setting FluidAlfa
[ 0] WARNING ! Unknown setting InitTemperature
[ 0] WARNING ! Unknown setting InletTemperature
[ ] ---- : Initializing Lattice ...
[ ] 0.9 MLBUps 0.13 GB/s [====================] 0s
[ ] ---- : Setting callback VTK at 1000.000000 iterations
[ ] ---- : Adding VTK to the solver hands
[ ] ---- : Setting action Solve at 10000.000000 iterations
[ ] 473.4 MLBUps 69.11 GB/s [ ] 0s
[ ] ---- : 596.9 MLBUps 87.15 GB/s [====================]
[ ] ---- : 1000 it writing vtk
[ ] ---- : 299.8 MLBUps 43.78 GB/s [====================]
[ ] ---- : 2000 it writing vtk
[ ] ---- : 335.2 MLBUps 48.94 GB/s [====================]
[ ] ---- : 3000 it writing vtk
[ ] ---- : 318.2 MLBUps 46.45 GB/s [====================]
[ ] ---- : 4000 it writing vtk
[ ] ---- : 467.0 MLBUps 68.18 GB/s [====================]
[ ] ---- : 5000 it writing vtk
[ ] ---- : 499.3 MLBUps 72.89 GB/s [====================]
[ ] ---- : 6000 it writing vtk
[ ] ---- : 501.6 MLBUps 73.23 GB/s [====================]
[ ] ---- : 7000 it writing vtk
[ ] ---- : 498.5 MLBUps 72.78 GB/s [====================]
[ ] ---- : 8000 it writing vtk
[ ] ---- : 503.3 MLBUps 73.48 GB/s [====================]
[ ] ---- : 9000 it writing vtk
[ ] ---- : 500.5 MLBUps 73.08 GB/s [====================]
[ ] ---- : 10000 it writing vtk
[ ] ---- : Total duration: 2.323629 s = 0.038727 min = 0.000645 h
@mdzik @ggruszczynski we should probably talk on this subject.
I created a repo: CFD-GO/TCLB_docker which with the use of travis-ci builds docker images and uploades them to dockerhub.
You can easily pull docker images to singularity.
It turnes out that for mpi to properly work in singularity you need to match host and container versions of openmpi. That why the repo builds images for different versions.
Thr images include only the environment for TCLB. Not the actual code. The idea is that you pull them with singularity or docker and then inside of them clone TCLB (your fork/branch/etc) and compile.
Important: MPI versions have to match, and you habe to first mpirun and second singularity. Otherwise you'll not be using host mpi library.
The test results I've got now are weird. The container version has bad performance over multiple nodes. But before small changes in the compilation process the speed was alright. So it's unconclusive for now.
I'll try to make more tests on more clusters today and tomorrow.
for CUDA you need -nv flag
mpirun singularity exec -nv tclb.sif TCLB/CLB/d2q9/main file.xml
@ggruszczynski - this is infiniband failure, there must be some mismatch of versions or ABI-incompatibility
@llaniewski Docker image should contain aptget part, conversion from Docker to Singularity is not 100% failsafe.
# Rysy - hardware info:
# CPU type: Intel Skylake
# GPU type: NVIDIA Volta
# No of nodes: 6
# No of cores per node: 36
# No of GPUs per node: 4
# CPU Memory per node: 380 GB
Have a look at Executing command:
CASE I
ggruszcz@rysy ~/TCLB $ p/run d2q9 example/flow/2d/karman.xml 4
Trying to run example/flow/2d/karman.xml with d2q9 model on 4 (mpi)processes/gpus. Job details:
CORES=NODESxTASKS_PER_NODExCORES_PER_TASK: 4 = 1 x 4 x 1
TOTAL_CPU_MEMORY=MEMORY_PER_CORExCORES: 20gb = 5gb x 4
Submitted batch job 6863
ggruszcz@rysy ~/TCLB $ cat slurm-6836.out
###### Nodes: #######
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
###### Loading modules #######
###### --------------- #######
Executing command:
singularity exec --nv /home/ggruszcz/TCLB/tclb_latest.sif mpirun /home/ggruszcz/TCLB/CLB/d2q9/main example/flow/2d/karman.xml
###### --------------- #######
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[49546,1],3]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: rysy-n1
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
MPMD: TCLB: local:0/4 work:0/4 --- connected to:
MPMD: TCLB: local:1/4 work:1/4 --- connected to:
MPMD: TCLB: local:2/4 work:2/4 --- connected to:
MPMD: TCLB: local:3/4 work:3/4 --- connected to:
[ ] #### : -------------------------------------------------------------------------
[ ] #### : - CLB version: v6.0-beta-1645-g39d9bf8 -
[ ] #### : - Model: d2q9 -
[ ] #### : -------------------------------------------------------------------------
[ ] #### : Setting output path to: karman
[ 1] ---- : Selecting device 1/4
[ 2] ---- : Selecting device 2/4
[ 0] ---- : Selecting device 0/4
[ 3] ---- : Selecting device 3/4
[ 3] warning ! No "Units" element in config file
[ 0] warning ! No "Units" element in config file
[ ] ==== : Mesh size in config file: 1024x100x1
[ ] ---- : Global lattice size: 1024x100x1
[ 2] warning ! No "Units" element in config file
[ 1] warning ! No "Units" element in config file
[ ] ==== : Max region size: 25600. Mesh size 102400. Overhead: 0%
[ ] ---- : Local lattice size: 1024x25x1
[ ] ---- : Threads | Action
[ ] ---- : 32x16 | Primal , NoGlobals , BaseIteration
[ ] ---- : 32x16 | Tangent , NoGlobals , BaseIteration
[ ] ---- : 32x16 | Optimize , NoGlobals , BaseIteration
[ ] ---- : 32x16 | SteadyAdjoint , NoGlobals , BaseIteration
[ ] ---- : 32x16 | Primal , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | Tangent , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | Optimize , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | SteadyAdjoint , IntegrateGlobals , BaseIteration
[ ] ---- : 32x16 | Primal , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | Tangent , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | Optimize , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | SteadyAdjoint , OnlyObjective , BaseIteration
[ ] ---- : 32x16 | Primal , NoGlobals , BaseInit
[ ] ---- : 32x16 | Tangent , NoGlobals , BaseInit
[ ] ---- : 32x16 | Optimize , NoGlobals , BaseInit
[ ] ---- : 32x16 | SteadyAdjoint , NoGlobals , BaseInit
[ ] ---- : 32x16 | Primal , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | Tangent , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | Optimize , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | SteadyAdjoint , IntegrateGlobals , BaseInit
[ ] ---- : 32x16 | Primal , OnlyObjective , BaseInit
[ ] ---- : 32x16 | Tangent , OnlyObjective , BaseInit
[ ] ---- : 32x16 | Optimize , OnlyObjective , BaseInit
[ ] ---- : 32x16 | SteadyAdjoint , OnlyObjective , BaseInit
[ ] #### : [0] Cumulative allocation of 3787328 b (3.8 MB)
[ ] ---- : Creating geom size:25600
[ ] #### : Setting output path to: karman
[ ] #### : Setting output path to: output/karman
[ ] ---- : loading geometry ...
[ ] ---- : Setting number of zones to 3
[ ] ---- : Setting VelocityX in zone (-1) to 0.01 (0.010000)
[ ] ---- : Setting Viscosity to 0.02 (0.020000)
[ ] ---- : [0] Settings [viscosity] to 0.020000
[ ] ---- : [0] Settings [one over relaxation time] to 1.785714
[ ] ---- : [0] Settings [MRT Sx] to -0.785714
[ 0] WARNING ! Unknown setting Smag
[ 0] WARNING ! Unknown setting PressDiffInObj
[ 0] WARNING ! Unknown setting EOSScale
[ 0] WARNING ! Unknown setting Tension
[ 0] WARNING ! Unknown setting Coriolis
[ 0] WARNING ! Unknown setting SolidAlfa
[ 0] WARNING ! Unknown setting FluidAlfa
[ 0] WARNING ! Unknown setting InitTemperature
[ 0] WARNING ! Unknown setting InletTemperature
[ ] ---- : Initializing Lattice ...
[ 1] WARNING ! Unknown setting Smag
[ 1] WARNING ! Unknown setting PressDiffInObj
[ 1] WARNING ! Unknown setting EOSScale
[ 1] WARNING ! Unknown setting Tension
[ 1] WARNING ! Unknown setting Coriolis
[ 1] WARNING ! Unknown setting SolidAlfa
[ 1] WARNING ! Unknown setting FluidAlfa
[ 1] WARNING ! Unknown setting InitTemperature
[ 1] WARNING ! Unknown setting InletTemperature
[ 2] WARNING ! Unknown setting Smag
[ 2] WARNING ! Unknown setting PressDiffInObj
[ 2] WARNING ! Unknown setting EOSScale
[ 2] WARNING ! Unknown setting Tension
[ 2] WARNING ! Unknown setting Coriolis
[ 2] WARNING ! Unknown setting SolidAlfa
[ 2] WARNING ! Unknown setting FluidAlfa
[ 2] WARNING ! Unknown setting InitTemperature
[ 2] WARNING ! Unknown setting InletTemperature
[ 3] WARNING ! Unknown setting Smag
[ 3] WARNING ! Unknown setting PressDiffInObj
[ 3] WARNING ! Unknown setting EOSScale
[ 3] WARNING ! Unknown setting Tension
[ 3] WARNING ! Unknown setting Coriolis
[ 3] WARNING ! Unknown setting SolidAlfa
[ 3] WARNING ! Unknown setting FluidAlfa
[ 3] WARNING ! Unknown setting InitTemperature
[ 3] WARNING ! Unknown setting InletTemperature
[ ] ---- : Setting callback VTK at 1000.000000 iterationss
[ ] ---- : Adding VTK to the solver hands
[ ] ---- : Setting action Solve at 10000.000000 iterations
[ ] ---- : 567.2 MLBUps 82.82 GB/s [====================]
[ ] ---- : 1000 it writing vtk
[ ] ---- : 459.4 MLBUps 67.08 GB/s [====================]
[ ] ---- : 2000 it writing vtk
[ ] ---- : 583.0 MLBUps 85.12 GB/s [====================]
[ ] ---- : 3000 it writing vtk
[ ] ---- : 577.5 MLBUps 84.32 GB/s [====================]
[ ] ---- : 4000 it writing vtk
[ ] ---- : 465.0 MLBUps 67.90 GB/s [====================]
[ ] ---- : 5000 it writing vtk
[ ] ---- : 586.0 MLBUps 85.55 GB/s [====================]
[ ] ---- : 6000 it writing vtk
[ ] ---- : 464.1 MLBUps 67.76 GB/s [====================]
[ ] ---- : 7000 it writing vtk
[ ] ---- : 584.5 MLBUps 85.34 GB/s [====================]
[ ] ---- : 8000 it writing vtk
[ ] ---- : 573.1 MLBUps 83.67 GB/s [====================]
[ ] ---- : 9000 it writing vtk
[ ] ---- : 587.1 MLBUps 85.71 GB/s [====================]
[ ] ---- : 10000 it writing vtk
[ ] ---- : Total duration: 2.141297 s = 0.035688 min = 0.000595 h
[rysy-n1.icm.edu.pl:56877] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[rysy-n1.icm.edu.pl:56877] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
CASE II
ggruszcz@rysy ~/TCLB $ p/run d2q9 example/flow/2d/karman.xml 8
Trying to run example/flow/2d/karman.xml with d2q9 model on 8 (mpi)processes/gpus. Job details:
CORES=NODESxTASKS_PER_NODExCORES_PER_TASK: 8 = 2 x 4 x 1
TOTAL_CPU_MEMORY=MEMORY_PER_CORExCORES: 40gb = 5gb x 8
Submitted batch job 6865
ggruszcz@rysy ~/TCLB $ cat slurm-6865.out
###### Nodes: #######
rysy-n1.icm.edu.pl
rysy-n2.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n2.icm.edu.pl
rysy-n2.icm.edu.pl
rysy-n2.icm.edu.pl
###### Loading modules #######
###### --------------- #######
Executing command:
singularity exec --nv /home/ggruszcz/TCLB/tclb_latest.sif mpirun /home/ggruszcz/TCLB/CLB/d2q9/main example/flow/2d/karman.xml
###### --------------- #######
[rysy-n1.icm.edu.pl:108073] [[47503,0],0] ORTE_ERROR_LOG: Not found in file plm_slurm_module.c at line 420
Conclusion: If you run first singularity and second mpirun (ie mpirun inside singularity container), then it is not possible to run the job on more than 1 Node.
The messange below dos not appear when the job run on 1 process.
[rysy-n1.icm.edu.pl:56877] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[rysy-n1.icm.edu.pl:56877] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
btl:no-nics - no network interfaces (memory transport btl) @llaniewski Have you succeed on prometheus via IB?
@mdzik I'm currently testing on a different cluster. But it look like the MPI in container have to match exactly the MPI on host, including OpenIB etc. Which looks not very practical. I don't even know how to install all this stuff (OpenIB, pmi2, ucx) that the host mpi is compiled with.
There is something called ABI compatibility, so those dependencies could be dynamically loaded - will investigate
pt., 20 mar 2020 o 01:40 Łukasz Łaniewski-Wołłk [email protected] napisał(a):
@mdzik https://github.com/mdzik I'm currently testing on a different cluster. But it look like the MPI in container have to match exactly the MPI on host, including OpenIB etc. Which looks not very practical. I don't even know how to install all this stuff (OpenIB, pmi2, ucx) that the host mpi is compiled with.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CFD-GO/TCLB/issues/214#issuecomment-601479440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3A5RZFTJFJ7MSRHMFOOXDRIK3P3ANCNFSM4IVQNDFQ .