TCLB icon indicating copy to clipboard operation
TCLB copied to clipboard

[internal] New execution interface and development plan

Open llaniewski opened this issue 6 years ago • 19 comments

As things are getting hard to maintain I propose a change of overall approach of configuration and execution of TCLB.

These are notes for a possible direction of development

  • Integrate TCLB_cluster into the main repo
  • Make a common wrapper script for running all calculations
    • The model selection would be based on something line <CLBConfig model="...">
    • It would run both native and slurm
    • It would run the code coupling (eg. ESYS/TCLB) based on best practices.
    • Could generate batch script (without submitting it to the queue).
  • Make machine specific alternation to configuration in a mechanism similar to the one done in TCLB_cluster
  • Allow for easier execution of tests on different machines/architectures
  • Make configuration more persistent (now you have to remember the ./configure options that you used on specific machine) (cmake has a similar mechanism, and can be considered as a solution)

The plan would be to:

  • Close V6.3 including DEM and all nice proposed features (#186 #198)
  • Begin V6.4 or V7.0 with #157 and maybe #198 + #125
  • make the wrappers for configuration and execution and do #124 at the same time.

llaniewski avatar Sep 11 '19 07:09 llaniewski

I would vote for outer layer be done in python/bash/ruby as those are designed for it. It would also make integration into something "bigger" easier?

I've successfully created docker.io image with test suit for TCLB. https://cloud.docker.com/repository/registry-1.docker.io/mdzik/tclb_testenv

so docker pull mdzik/tclb_testenv should work

mdzik avatar Sep 19 '19 10:09 mdzik

I done some docker images here: https://hub.docker.com/r/cfdgo/tclb If we want to make docker an option, we should do it properly. It's not easy as in docker there is no multiple-inheritance. This means that one cannot make something that inherist from both (example) cuda image and R image. I did a configuration which works with the official nvidia cuda image some time ago, but I don't know if it is very useful. It doesn't work with GPU on windows. And to work with GPU on linux you have to have a non-standard installation of docker.

And on the side note non of the clusters work with docker. So I think docker is generally good, but only for testing. But testing GPU will still be a problem.

llaniewski avatar Sep 19 '19 11:09 llaniewski

The dockerfiles are at feature/docker branch.

llaniewski avatar Sep 19 '19 11:09 llaniewski

Multiple inheritances is (i think) done by docker-compose. Using docker images might be useful if one intends to do something inside Amazon AWS, but I only intend to create an environment for local tests evaluation, hence travis inheritance. I would like to finally pull csf model, but tests are holding me back ;)

mdzik avatar Sep 19 '19 11:09 mdzik

(nvidia-docker is workable with AWS)

mdzik avatar Sep 19 '19 11:09 mdzik

I've done some digging

One of the "proper" ways to distribute could be singularity image (https://sylabs.io/singularity/). It is somehow close to docker, but in scientific software in mind.

  • it is supported on the Prometheus, it seems that it supports nvidia/cuda
  • it is supported (and encouraged) at the ICM
  • it could be nice to have integrated TCLB+*MPI+FlyingBalls as a package, with sorted out dependencies and inside preconfigured env. Build elsewhere

questions:

  • GPU performance (i could check that on K20@Cyfronet and V100@ICM)
  • MPMD performance ?
  • how does it help? :)
  • other?

mdzik avatar Nov 04 '19 10:11 mdzik

@mdzik Could you make a proof-of-concept Singularity container with TCLB code? Let's say: OpenMPI+GCC, no fancy stuff. Let's start without CUDA, and progress from that.

llaniewski avatar Nov 04 '19 11:11 llaniewski

I'll try

The goal is to have (as far as I feel syntax):

$ module load singularity
$ singularity pull somehost/tclb.sif
$ singularity run tclb-gcc.sif  d2q9 ~/test/karman.xml

mdzik avatar Nov 04 '19 11:11 mdzik

Nie summary: https://tin6150.github.io/psg/blogger_container_hpc.html

mdzik avatar Nov 19 '19 09:11 mdzik

hay, i've got this

singularity pull library://mdzik/tclb/tclb:latest
singularity verify ./tclb_latest.sif

then to use build-in TCLB:

singularity exec --nv  ./tclb_latest.sif /opt/TCLB_gpu/TCLB/CLB/d2q9/main  /opt/TCLB_gpu/TCLB/example/flow/2d/karman.xml

or to use in-container shell, with all Rtools etc, to build local tclb:

singularity -nv shell ./tclb_latest.sif

How do we proceed? :)

mdzik avatar Feb 18 '20 16:02 mdzik

@mdzik Looks more and more viable. Did you test is on prometheus? Could these images be build on Travis-CI and published?

What's the route to test it on my computer?

llaniewski avatar Feb 20 '20 03:02 llaniewski

add 1 - not yet, but singularity is there, so it should be ok add 2 - yes, but might need some scripting (some of them are avealible from singularity)

czw., 20 lut 2020 o 04:24 Łukasz Łaniewski-Wołłk [email protected] napisał(a):

@mdzik https://github.com/mdzik Looks more and more viable. Did you test is on prometheus? Could these images be build on Travis-CI and published?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CFD-GO/TCLB/issues/214?email_source=notifications&email_token=AA3A5R32UIC36BTXXDGLIE3RDXZWXA5CNFSM4IVQNDF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKSZHQ#issuecomment-588590238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3A5R5NWV5HQ4VTJXX5RO3RDXZWXANCNFSM4IVQNDFQ .

mdzik avatar Feb 20 '20 08:02 mdzik

There are doubts whether the singularity handles the high-performance Open MPI point-to-point messaging module in a right way.

@mdzik please check the how it runs without singularity.

singularity run-wrapper: https://github.com/CFD-GO/TCLB_cluster/pull/6

p/run d2q9 example/flow/2d/karman.xml cat slurm-6800.out

###### Nodes:          #######
rysy-n1.icm.edu.pl
###### Loading modules #######
###### --------------- #######
Executing command:
singularity exec --nv /home/ggruszcz/TCLB/tclb_latest.sif /home/ggruszcz/TCLB/CLB/d2q9/main example/flow/2d/karman.xml

###### --------------- #######

libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[51802,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: rysy-n1

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
MPMD: TCLB: local:0/1 work:0/1 ---  connected to:
[  ]    #### : -------------------------------------------------------------------------
[  ]    #### : -  CLB version:   v6.0-beta-1645-g39d9bf8                               -
[  ]    #### : -        Model:                      d2q9                               -
[  ]    #### : -------------------------------------------------------------------------
[  ]    #### : Setting output path to: karman
[ 0]    ---- : Selecting device 0/1
[ 0] warning ! No "Units" element in config file
[  ]    ==== : Mesh size in config file: 1024x100x1
[  ]    ---- : Global lattice size: 1024x100x1
[  ]    ==== : Max region size: 102400. Mesh size 102400. Overhead:  0%
[  ]    ---- : Local lattice size: 1024x100x1
[  ]    ---- :   Threads  |      Action
[  ]    ---- :    32x16   | Primal , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | Tangent , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | Optimize , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | SteadyAdjoint , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | Primal , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | Tangent , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | Optimize , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | SteadyAdjoint , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | Primal , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | Tangent , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | Optimize , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | SteadyAdjoint , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | Primal , NoGlobals , BaseInit
[  ]    ---- :    32x16   | Tangent , NoGlobals , BaseInit
[  ]    ---- :    32x16   | Optimize , NoGlobals , BaseInit
[  ]    ---- :    32x16   | SteadyAdjoint , NoGlobals , BaseInit
[  ]    ---- :    32x16   | Primal , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | Tangent , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | Optimize , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | SteadyAdjoint , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | Primal , OnlyObjective , BaseInit
[  ]    ---- :    32x16   | Tangent , OnlyObjective , BaseInit
[  ]    ---- :    32x16   | Optimize , OnlyObjective , BaseInit
[  ]    ---- :    32x16   | SteadyAdjoint , OnlyObjective , BaseInit
[  ]    #### : [0] Cumulative allocation of 14853696 b (14.9 MB)
[  ]    ---- : Creating geom size:102400
[  ]    #### : Setting output path to: karman
[  ]    #### : Setting output path to: output/karman
[  ]    ---- : loading geometry ...
[  ]    ---- : Setting number of zones to 3
[  ]    ---- : Setting VelocityX in zone  (-1) to 0.01 (0.010000)
[  ]    ---- : Setting Viscosity to 0.02 (0.020000)
[  ]    ---- : [0] Settings [viscosity] to 0.020000
[  ]    ---- : [0] Settings [one over relaxation time] to 1.785714
[  ]    ---- : [0] Settings [MRT Sx] to -0.785714
[ 0] WARNING ! Unknown setting Smag
[ 0] WARNING ! Unknown setting PressDiffInObj
[ 0] WARNING ! Unknown setting EOSScale
[ 0] WARNING ! Unknown setting Tension
[ 0] WARNING ! Unknown setting Coriolis
[ 0] WARNING ! Unknown setting SolidAlfa
[ 0] WARNING ! Unknown setting FluidAlfa
[ 0] WARNING ! Unknown setting InitTemperature
[ 0] WARNING ! Unknown setting InletTemperature
[  ]    ---- : Initializing Lattice ...
[  ]      0.9 MLBUps      0.13 GB/s [====================]  0s
[  ]    ---- : Setting callback VTK at 1000.000000 iterations
[  ]    ---- : Adding VTK to the solver hands
[  ]    ---- : Setting action Solve at 10000.000000 iterations
[  ]    473.4 MLBUps     69.11 GB/s [                    ]  0s
[  ]    ---- :    596.9 MLBUps     87.15 GB/s [====================]
[  ]    ---- :     1000 it writing vtk
[  ]    ---- :    299.8 MLBUps     43.78 GB/s [====================]
[  ]    ---- :     2000 it writing vtk
[  ]    ---- :    335.2 MLBUps     48.94 GB/s [====================]
[  ]    ---- :     3000 it writing vtk
[  ]    ---- :    318.2 MLBUps     46.45 GB/s [====================]
[  ]    ---- :     4000 it writing vtk
[  ]    ---- :    467.0 MLBUps     68.18 GB/s [====================]
[  ]    ---- :     5000 it writing vtk
[  ]    ---- :    499.3 MLBUps     72.89 GB/s [====================]
[  ]    ---- :     6000 it writing vtk
[  ]    ---- :    501.6 MLBUps     73.23 GB/s [====================]
[  ]    ---- :     7000 it writing vtk
[  ]    ---- :    498.5 MLBUps     72.78 GB/s [====================]
[  ]    ---- :     8000 it writing vtk
[  ]    ---- :    503.3 MLBUps     73.48 GB/s [====================]
[  ]    ---- :     9000 it writing vtk
[  ]    ---- :    500.5 MLBUps     73.08 GB/s [====================]
[  ]    ---- :    10000 it writing vtk
[  ]    ---- : Total duration: 2.323629 s = 0.038727 min = 0.000645 h

ggruszczynski avatar Mar 18 '20 22:03 ggruszczynski

@mdzik @ggruszczynski we should probably talk on this subject.

I created a repo: CFD-GO/TCLB_docker which with the use of travis-ci builds docker images and uploades them to dockerhub.

You can easily pull docker images to singularity.

It turnes out that for mpi to properly work in singularity you need to match host and container versions of openmpi. That why the repo builds images for different versions.

Thr images include only the environment for TCLB. Not the actual code. The idea is that you pull them with singularity or docker and then inside of them clone TCLB (your fork/branch/etc) and compile.

Important: MPI versions have to match, and you habe to first mpirun and second singularity. Otherwise you'll not be using host mpi library.

The test results I've got now are weird. The container version has bad performance over multiple nodes. But before small changes in the compilation process the speed was alright. So it's unconclusive for now.

I'll try to make more tests on more clusters today and tomorrow.

llaniewski avatar Mar 19 '20 06:03 llaniewski

for CUDA you need -nv flag

mpirun singularity  exec -nv  tclb.sif TCLB/CLB/d2q9/main file.xml

@ggruszczynski - this is infiniband failure, there must be some mismatch of versions or ABI-incompatibility

@llaniewski Docker image should contain aptget part, conversion from Docker to Singularity is not 100% failsafe.

mdzik avatar Mar 19 '20 08:03 mdzik

# Rysy - hardware info:
# CPU type: Intel Skylake
# GPU type: NVIDIA Volta
# No of nodes: 6
# No of cores per node: 36
# No of GPUs per node: 4
# CPU Memory per node: 380 GB

Have a look at Executing command:

CASE I

ggruszcz@rysy ~/TCLB $ p/run d2q9  example/flow/2d/karman.xml 4
Trying to run example/flow/2d/karman.xml with d2q9 model on 4 (mpi)processes/gpus. Job details:
        CORES=NODESxTASKS_PER_NODExCORES_PER_TASK: 4 = 1 x 4 x 1
        TOTAL_CPU_MEMORY=MEMORY_PER_CORExCORES: 20gb = 5gb x 4
Submitted batch job 6863

ggruszcz@rysy ~/TCLB $ cat slurm-6836.out
###### Nodes:          #######
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
###### Loading modules #######
###### --------------- #######
Executing command:
singularity exec --nv /home/ggruszcz/TCLB/tclb_latest.sif mpirun /home/ggruszcz/TCLB/CLB/d2q9/main example/flow/2d/karman.xml

###### --------------- #######

libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[49546,1],3]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: rysy-n1

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
MPMD: TCLB: local:0/4 work:0/4 ---  connected to:
MPMD: TCLB: local:1/4 work:1/4 ---  connected to:
MPMD: TCLB: local:2/4 work:2/4 ---  connected to:
MPMD: TCLB: local:3/4 work:3/4 ---  connected to:
[  ]    #### : -------------------------------------------------------------------------
[  ]    #### : -  CLB version:   v6.0-beta-1645-g39d9bf8                               -
[  ]    #### : -        Model:                      d2q9                               -
[  ]    #### : -------------------------------------------------------------------------
[  ]    #### : Setting output path to: karman
[ 1]    ---- : Selecting device 1/4
[ 2]    ---- : Selecting device 2/4
[ 0]    ---- : Selecting device 0/4
[ 3]    ---- : Selecting device 3/4
[ 3] warning ! No "Units" element in config file
[ 0] warning ! No "Units" element in config file
[  ]    ==== : Mesh size in config file: 1024x100x1
[  ]    ---- : Global lattice size: 1024x100x1
[ 2] warning ! No "Units" element in config file
[ 1] warning ! No "Units" element in config file
[  ]    ==== : Max region size: 25600. Mesh size 102400. Overhead:  0%
[  ]    ---- : Local lattice size: 1024x25x1
[  ]    ---- :   Threads  |      Action
[  ]    ---- :    32x16   | Primal , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | Tangent , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | Optimize , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | SteadyAdjoint , NoGlobals , BaseIteration
[  ]    ---- :    32x16   | Primal , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | Tangent , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | Optimize , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | SteadyAdjoint , IntegrateGlobals , BaseIteration
[  ]    ---- :    32x16   | Primal , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | Tangent , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | Optimize , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | SteadyAdjoint , OnlyObjective , BaseIteration
[  ]    ---- :    32x16   | Primal , NoGlobals , BaseInit
[  ]    ---- :    32x16   | Tangent , NoGlobals , BaseInit
[  ]    ---- :    32x16   | Optimize , NoGlobals , BaseInit
[  ]    ---- :    32x16   | SteadyAdjoint , NoGlobals , BaseInit
[  ]    ---- :    32x16   | Primal , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | Tangent , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | Optimize , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | SteadyAdjoint , IntegrateGlobals , BaseInit
[  ]    ---- :    32x16   | Primal , OnlyObjective , BaseInit
[  ]    ---- :    32x16   | Tangent , OnlyObjective , BaseInit
[  ]    ---- :    32x16   | Optimize , OnlyObjective , BaseInit
[  ]    ---- :    32x16   | SteadyAdjoint , OnlyObjective , BaseInit
[  ]    #### : [0] Cumulative allocation of 3787328 b (3.8 MB)
[  ]    ---- : Creating geom size:25600
[  ]    #### : Setting output path to: karman
[  ]    #### : Setting output path to: output/karman
[  ]    ---- : loading geometry ...
[  ]    ---- : Setting number of zones to 3
[  ]    ---- : Setting VelocityX in zone  (-1) to 0.01 (0.010000)
[  ]    ---- : Setting Viscosity to 0.02 (0.020000)
[  ]    ---- : [0] Settings [viscosity] to 0.020000
[  ]    ---- : [0] Settings [one over relaxation time] to 1.785714
[  ]    ---- : [0] Settings [MRT Sx] to -0.785714
[ 0] WARNING ! Unknown setting Smag
[ 0] WARNING ! Unknown setting PressDiffInObj
[ 0] WARNING ! Unknown setting EOSScale
[ 0] WARNING ! Unknown setting Tension
[ 0] WARNING ! Unknown setting Coriolis
[ 0] WARNING ! Unknown setting SolidAlfa
[ 0] WARNING ! Unknown setting FluidAlfa
[ 0] WARNING ! Unknown setting InitTemperature
[ 0] WARNING ! Unknown setting InletTemperature
[  ]    ---- : Initializing Lattice ...
[ 1] WARNING ! Unknown setting Smag
[ 1] WARNING ! Unknown setting PressDiffInObj
[ 1] WARNING ! Unknown setting EOSScale
[ 1] WARNING ! Unknown setting Tension
[ 1] WARNING ! Unknown setting Coriolis
[ 1] WARNING ! Unknown setting SolidAlfa
[ 1] WARNING ! Unknown setting FluidAlfa
[ 1] WARNING ! Unknown setting InitTemperature
[ 1] WARNING ! Unknown setting InletTemperature
[ 2] WARNING ! Unknown setting Smag
[ 2] WARNING ! Unknown setting PressDiffInObj
[ 2] WARNING ! Unknown setting EOSScale
[ 2] WARNING ! Unknown setting Tension
[ 2] WARNING ! Unknown setting Coriolis
[ 2] WARNING ! Unknown setting SolidAlfa
[ 2] WARNING ! Unknown setting FluidAlfa
[ 2] WARNING ! Unknown setting InitTemperature
[ 2] WARNING ! Unknown setting InletTemperature
[ 3] WARNING ! Unknown setting Smag
[ 3] WARNING ! Unknown setting PressDiffInObj
[ 3] WARNING ! Unknown setting EOSScale
[ 3] WARNING ! Unknown setting Tension
[ 3] WARNING ! Unknown setting Coriolis
[ 3] WARNING ! Unknown setting SolidAlfa
[ 3] WARNING ! Unknown setting FluidAlfa
[ 3] WARNING ! Unknown setting InitTemperature
[ 3] WARNING ! Unknown setting InletTemperature
[  ]    ---- : Setting callback VTK at 1000.000000 iterationss
[  ]    ---- : Adding VTK to the solver hands
[  ]    ---- : Setting action Solve at 10000.000000 iterations
[  ]    ---- :    567.2 MLBUps     82.82 GB/s [====================]
[  ]    ---- :     1000 it writing vtk
[  ]    ---- :    459.4 MLBUps     67.08 GB/s [====================]
[  ]    ---- :     2000 it writing vtk
[  ]    ---- :    583.0 MLBUps     85.12 GB/s [====================]
[  ]    ---- :     3000 it writing vtk
[  ]    ---- :    577.5 MLBUps     84.32 GB/s [====================]
[  ]    ---- :     4000 it writing vtk
[  ]    ---- :    465.0 MLBUps     67.90 GB/s [====================]
[  ]    ---- :     5000 it writing vtk
[  ]    ---- :    586.0 MLBUps     85.55 GB/s [====================]
[  ]    ---- :     6000 it writing vtk
[  ]    ---- :    464.1 MLBUps     67.76 GB/s [====================]
[  ]    ---- :     7000 it writing vtk
[  ]    ---- :    584.5 MLBUps     85.34 GB/s [====================]
[  ]    ---- :     8000 it writing vtk
[  ]    ---- :    573.1 MLBUps     83.67 GB/s [====================]
[  ]    ---- :     9000 it writing vtk
[  ]    ---- :    587.1 MLBUps     85.71 GB/s [====================]
[  ]    ---- :    10000 it writing vtk
[  ]    ---- : Total duration: 2.141297 s = 0.035688 min = 0.000595 h
[rysy-n1.icm.edu.pl:56877] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[rysy-n1.icm.edu.pl:56877] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

CASE II

ggruszcz@rysy ~/TCLB $ p/run d2q9  example/flow/2d/karman.xml 8
Trying to run example/flow/2d/karman.xml with d2q9 model on 8 (mpi)processes/gpus. Job details:
        CORES=NODESxTASKS_PER_NODExCORES_PER_TASK: 8 = 2 x 4 x 1
        TOTAL_CPU_MEMORY=MEMORY_PER_CORExCORES: 40gb = 5gb x 8
Submitted batch job 6865

ggruszcz@rysy ~/TCLB $ cat slurm-6865.out
###### Nodes:          #######
rysy-n1.icm.edu.pl
rysy-n2.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n1.icm.edu.pl
rysy-n2.icm.edu.pl
rysy-n2.icm.edu.pl
rysy-n2.icm.edu.pl
###### Loading modules #######
###### --------------- #######
Executing command:
singularity exec --nv /home/ggruszcz/TCLB/tclb_latest.sif mpirun /home/ggruszcz/TCLB/CLB/d2q9/main example/flow/2d/karman.xml

###### --------------- #######

[rysy-n1.icm.edu.pl:108073] [[47503,0],0] ORTE_ERROR_LOG: Not found in file plm_slurm_module.c at line 420

Conclusion: If you run first singularity and second mpirun (ie mpirun inside singularity container), then it is not possible to run the job on more than 1 Node.

The messange below dos not appear when the job run on 1 process.

[rysy-n1.icm.edu.pl:56877] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[rysy-n1.icm.edu.pl:56877] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

ggruszczynski avatar Mar 19 '20 12:03 ggruszczynski

btl:no-nics - no network interfaces (memory transport btl) @llaniewski Have you succeed on prometheus via IB?

mdzik avatar Mar 19 '20 15:03 mdzik

@mdzik I'm currently testing on a different cluster. But it look like the MPI in container have to match exactly the MPI on host, including OpenIB etc. Which looks not very practical. I don't even know how to install all this stuff (OpenIB, pmi2, ucx) that the host mpi is compiled with.

llaniewski avatar Mar 20 '20 00:03 llaniewski

There is something called ABI compatibility, so those dependencies could be dynamically loaded - will investigate

pt., 20 mar 2020 o 01:40 Łukasz Łaniewski-Wołłk [email protected] napisał(a):

@mdzik https://github.com/mdzik I'm currently testing on a different cluster. But it look like the MPI in container have to match exactly the MPI on host, including OpenIB etc. Which looks not very practical. I don't even know how to install all this stuff (OpenIB, pmi2, ucx) that the host mpi is compiled with.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CFD-GO/TCLB/issues/214#issuecomment-601479440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3A5RZFTJFJ7MSRHMFOOXDRIK3P3ANCNFSM4IVQNDFQ .

mdzik avatar Mar 20 '20 18:03 mdzik