samurai icon indicating copy to clipboard operation
samurai copied to clipboard

[Question]: Examples with FOM or similar?

Open vsoch opened this issue 7 months ago • 19 comments

What do you want?

Hi! Do you have any examples or tutorial that could be run for a strong or weak scaling study, and have some kind of FOM (even if just running time)? We are doing a study on 4 to 64 nodes and looking for apps / proxy apps / benchmarks / synthetic benchmarks that could be candidates. The only requirement is that I can build it into a container and run it across nodes. Thanks!

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

vsoch avatar May 05 '25 15:05 vsoch

We are actively working on efficient samurai parallelization and hope to have working examples in June. You can run cases with MPI, but the performance will probably not be there when you enable mesh adaptation. The biggest challenge with these methods is the load balancing between subdomains, which has to happen regularly because of the dynamic adaptation. We are working on different methods for efficient load balancing: space filling curves or diffusion algorithm.

You can try the advection2d.cpp case in the demos/FiniteVolume directory without mesh refinement (min_level == max_level). We use CLI11 to manage various options in the command line. You can print them using the executable name following by -h. When we have finished the implementation of the load balancing, we can repeat the experiment with mesh adaptation. What do you think?

In any case, we're interested in your approach and setting up the procedure could be useful once everything is ready.

I've seen that other projects have been contacted. Could you tell us a little more about the purpose of this study?

gouarin avatar May 06 '25 15:05 gouarin

I've seen that other projects have been contacted. Could you tell us a little more about the purpose of this study?

Sure! We are running a study on Google Cloud H3 instances, and I don't want to call it a performance study because it's far from a classical performance study, but we are running apps from sizes 4 to 64, 128, or 256 (depending on the scaling result) for strong or weak scaling. What we want to highlight for the work is the deployment and portability of the apps - each is deployed with Flux (an HPC job scheduler, the system scheduler on El Capitan via helm charts, and you can see the set here:

https://github.com/converged-computing/flux-apps-helm/

So far I've done just under 30, and we are going to run out of credits at the end of the month, so I'm strategically trying to put together batches to run (for each app I need to understand it, containerize it, test it, and then deploy at scale). I created and manage the RSEPedia so I'm searching in there to find apps (and how I found you).

Although right right now I'm on a small road trip, but will pick up work Thursday evening!

vsoch avatar May 07 '25 03:05 vsoch

I'm having a hard time building. I first tried installing dependencies directly with the system package manager, but hit that my gcc didn't support c++17. The conan install didn't work. So I tried setting up the development environment with mamba (and got farther there):

RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh && \
   chmod +x Miniforge3-Linux-x86_64.sh && \
   bash Miniforge3-Linux-x86_64.sh -p /opt/miniconda -b

ENV PATH=/opt/miniconda/bin:$PATH
RUN mamba install -y samurai && \
    mamba install -y cxx-compiler cmake [make] && \
    mamba install -y petsc pkg-config && \
    mamba install -y libboost-mpi libboost-devel libboost-headers 'hdf5=*=mpi*'
RUN git clone https://github.com/hpc-maths/samurai /opt/samurai && \
    cd /opt/samurai && \
    cmake . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_DEMOS=ON && \
    cmake --build ./build --config Release

But then for the second to last cmake command:

/opt/samurai/include/samurai/io/hdf5.hpp:765:45: error: no matching function for call to 'HighFive::Selection::write_raw(samurai::ScalarField<samurai::MRMesh<samurai::MRConfig<2> >, long unsigned int>::value_type*&, HighFive::AtomicType<long unsigned int>, HighFive::PropertyList<HighFive::PropertyType::DATASET_XFER>&)'
  765 |                         data_slice.write_raw(data_ptr, HighFive::AtomicType<typename Field::value_type>{}, xfer_props);

Do you have a Dockerfile already working, or can make a suggestion? Ideally I could build this alongside system software (and not need an isolated environment). Thanks!

vsoch avatar May 09 '25 02:05 vsoch

In the conda directory, we have the MPI environment needed for samurai. But it seems that you already make the good choices for package installation. You also have to activate MPI support, which is not the default behavior. To do that, you have to add the following option in the cmake command line : -DWITH_MPI=ON.

You can take a look at the CI where the MPI support is tested : https://github.com/hpc-maths/samurai/blob/master/.github/workflows/ci.yml#L218-L306

We also have a spack package for samurai : https://packages.spack.io/package.html?name=samurai This is not the last version, but we will update it today. You can have Dockerfile for free with spack as explain here: https://spack.readthedocs.io/en/latest/containers.html

Hope this help !

gouarin avatar May 09 '25 05:05 gouarin

That built the software OK, but then to compile all the examples failed.

Image

Scrolling up to the top of my terminal, I don't see any red so it's not clear what failed. Can you direct me at a specific build command (akin to one in the CI yaml you shared) that would be best fit for the scaling study? Likely I can build one of the demos. I know you mentioned:

You can try the advection2d.cpp case in the demos/FiniteVolume directory without mesh refinement

Can you show me how to build and run that? Apoloigies for having to ask - I'm not a pro with cmake!

vsoch avatar May 09 '25 17:05 vsoch

Here is the full updated Dockerfile. The base has our HPC workload manager Flux.

ARG base=ghcr.io/converged-computing/flux-openmpi:ubuntu2204
FROM ${base}
WORKDIR /opt

RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh && \
   chmod +x Miniforge3-Linux-x86_64.sh && \
   bash Miniforge3-Linux-x86_64.sh -p /opt/miniconda -b

ENV PATH=/opt/miniconda/bin:$PATH
COPY ./mpi-environment.yaml ./mpi-environment.yaml
RUN mamba env create --file ./mpi-environment.yaml 
&& \
    mamba shell init --shell bash && \
    . ~/.bashrc && \
    mamba activate samurai-env && \
    mamba install -y mpich petsc pkg-config cxx-compiler
RUN git clone https://github.com/hpc-maths/samurai /opt/samurai && \
    cd /opt/samurai && \
    cmake . -Bbuild -GNinja \
       -DCMAKE_BUILD_TYPE=Release \
       -DWITH_MPI=ON \
       -DBUILD_DEMOS=ON \
       -DBUILD_TESTS=ON && \
       cmake --build ./build --config Release

The last command above is what failed.

vsoch avatar May 09 '25 17:05 vsoch

I realize that we're only compiling two examples with MPI:

https://github.com/hpc-maths/samurai/blob/master/.github/workflows/ci.yml#L272-L273

So I suggest to start with

cmake --build build --target finite-volume-advection-2d 

and confirm that it works. I know that other examples work, as we work on their load balancing. But, since we haven't tried to compile all the executables of samurai, maybe we have some issues. Sorry about that. I will have a look and fix it.

gouarin avatar May 09 '25 18:05 gouarin

When I build that example, I get this error again:

Image For the Dockerfile above, everything is the same except for the last line is changed to the one you provided. Do I have an issue with a dependency version or similar? It seems this has the incorrect number of arguments:

/opt/miniconda/envs/samurai-env/include/highfive/bits/H5Slice_traits.hpp:122:10: note:   template argument deduction/substitution failed:
/opt/samurai/include/samurai/io/hdf5.hpp:765:45: note:   candidate expects 2 arguments, 3 provided
  765 |                         data_slice.write_raw(data_ptr, HighFive::AtomicType<typename Field::value_type>{}, xfer_props);

vsoch avatar May 09 '25 18:05 vsoch

Could you add the command mamba list after enabling the environment ?

I will check the versions to see if anything is wrong.

gouarin avatar May 09 '25 18:05 gouarin

Good idea!

(samurai-env) root@2f26b57dd5c7:/opt/samurai# mamba list
List of packages in environment: "/opt/miniconda/envs/samurai-env"

  Name                      Version       Build                    Channel    
────────────────────────────────────────────────────────────────────────────────
  _libgcc_mutex             0.1           conda_forge              conda-forge
  _openmp_mutex             4.5           3_kmp_llvm               conda-forge
  _x86_64-microarch-level   4             2_x86_64_v4              conda-forge
  attr                      2.5.1         h166bdaf_1               conda-forge
  binutils                  2.43          h4852527_4               conda-forge
  binutils_impl_linux-64    2.43          h4bf12b8_4               conda-forge
  binutils_linux-64         2.43          h4852527_4               conda-forge
  bzip2                     1.0.8         h4bc722e_7               conda-forge
  c-ares                    1.34.5        hb9d3cd8_0               conda-forge
  c-compiler                1.9.0         h2b85faf_0               conda-forge
  ca-certificates           2025.4.26     hbd8a1cb_0               conda-forge
  cached-property           1.5.2         hd8ed1ab_1               conda-forge
  cached_property           1.5.2         pyha770c72_1             conda-forge
  cli11                     2.4.2         h5888daf_0               conda-forge
  cmake                     4.0.2         h74e3db0_0               conda-forge
  colorama                  0.4.6         pyhd8ed1ab_1             conda-forge
  cxx-compiler              1.9.0         h1a2810e_0               conda-forge
  cxxopts                   3.2.1         h74c10a1_1               conda-forge
  exceptiongroup            1.2.2         pyhd8ed1ab_1             conda-forge
  fftw                      3.3.10        mpi_mpich_hbcf76dd_10    conda-forge
  fmt                       11.1.4        h07f6e7f_1               conda-forge
  gcc                       13.3.0        h9576a4e_2               conda-forge
  gcc_impl_linux-64         13.3.0        h1e990d8_2               conda-forge
  gcc_linux-64              13.3.0        hc28eda2_10              conda-forge
  gxx                       13.3.0        h9576a4e_2               conda-forge
  gxx_impl_linux-64         13.3.0        hae580e1_2               conda-forge
  gxx_linux-64              13.3.0        h6834431_10              conda-forge
  h5py                      3.13.0        nompi_py313hfaf8fd4_101  conda-forge
  hdf5                      1.14.6        mpi_mpich_h7f58efa_1     conda-forge
  highfive                  2.3.1         h4bd325d_0               conda-forge
  hypre                     2.32.0        mpi_mpich_h2e71eac_1     conda-forge
  icu                       75.1          he02047a_0               conda-forge
  iniconfig                 2.0.0         pyhd8ed1ab_1             conda-forge
  kernel-headers_linux-64   3.10.0        he073ed8_18              conda-forge
  keyutils                  1.6.1         h166bdaf_0               conda-forge
  krb5                      1.21.3        h659f571_0               conda-forge
  ld_impl_linux-64          2.43          h712a8e2_4               conda-forge
  libaec                    1.1.3         h59595ed_0               conda-forge
  libamd                    3.3.3         haaf9dc3_7100102         conda-forge
  libblas                   3.9.0         31_h59b9bed_openblas     conda-forge
  libboost                  1.85.0        h0ccab89_4               conda-forge
  libboost-devel            1.85.0        h00ab1b0_4               conda-forge
  libboost-headers          1.85.0        ha770c72_4               conda-forge
  libboost-mpi              1.85.0        h750f1fb_3               conda-forge
  libbtf                    2.3.2         h32481e8_7100102         conda-forge
  libcamd                   3.3.3         h32481e8_7100102         conda-forge
  libcap                    2.75          h39aace5_0               conda-forge
  libcblas                  3.9.0         31_he106b2a_openblas     conda-forge
  libccolamd                3.3.4         h32481e8_7100102         conda-forge
  libcholmod                5.3.1         h59ddab4_7100102         conda-forge
  libcolamd                 3.3.4         h32481e8_7100102         conda-forge
  libcurl                   8.13.0        h332b0f4_0               conda-forge
  libedit                   3.1.20250104  pl5321h7949ede_0         conda-forge
  libev                     4.33          hd590300_2               conda-forge
  libevent                  2.1.12        hf998b51_1               conda-forge
  libexpat                  2.7.0         h5888daf_0               conda-forge
  libfabric                 2.1.0         ha770c72_1               conda-forge
  libfabric1                2.1.0         hf45584d_1               conda-forge
  libffi                    3.4.6         h2dba641_1               conda-forge
  libgcc                    15.1.0        h767d61c_2               conda-forge
  libgcc-devel_linux-64     13.3.0        hc03c837_102             conda-forge
  libgcc-ng                 15.1.0        h69a702a_2               conda-forge
  libgcrypt-lib             1.11.0        hb9d3cd8_2               conda-forge
  libgfortran               15.1.0        h69a702a_2               conda-forge
  libgfortran-ng            15.1.0        h69a702a_2               conda-forge
  libgfortran5              15.1.0        hcea5267_2               conda-forge
  libgomp                   15.1.0        h767d61c_2               conda-forge
  libgpg-error              1.55          h3f2d84a_0               conda-forge
  libhwloc                  2.11.2        default_h0d58e46_1001    conda-forge
  libiconv                  1.18          h4ce23a2_1               conda-forge
  libklu                    2.3.5         hf24d653_7100102         conda-forge
  liblapack                 3.9.0         31_h7ac8fdf_openblas     conda-forge
  liblzma                   5.8.1         hb9d3cd8_1               conda-forge
  liblzma-devel             5.8.1         hb9d3cd8_1               conda-forge
  libmpdec                  4.0.0         h4bc722e_0               conda-forge
  libnghttp2                1.64.0        h161d5f1_0               conda-forge
  libnl                     3.11.0        hb9d3cd8_0               conda-forge
  libopenblas               0.3.29        openmp_hd680484_0        conda-forge
  libpmix                   5.0.7         h658e747_0               conda-forge
  libptscotch               7.0.6         h4c3caac_1               conda-forge
  libsanitizer              13.3.0        he8ea267_2               conda-forge
  libscotch                 7.0.6         hea33c07_1               conda-forge
  libspqr                   4.3.4         h852d39f_7100102         conda-forge
  libsqlite                 3.49.2        hee588c1_0               conda-forge
  libssh2                   1.11.1        hcf80075_0               conda-forge
  libstdcxx                 15.1.0        h8f9b012_2               conda-forge
  libstdcxx-devel_linux-64  13.3.0        hc03c837_102             conda-forge
  libstdcxx-ng              15.1.0        h4852527_2               conda-forge
  libsuitesparseconfig      7.10.1        h92d6892_7100102         conda-forge
  libsystemd0               257.4         h4e0b6ca_1               conda-forge
  libudev1                  257.4         hbe16f8c_1               conda-forge
  libumfpack                6.3.5         heb53515_7100102         conda-forge
  libuuid                   2.38.1        h0b41bf4_0               conda-forge
  libuv                     1.50.0        hb9d3cd8_0               conda-forge
  libxml2                   2.13.8        h4bc477f_0               conda-forge
  libzlib                   1.3.1         hb9d3cd8_2               conda-forge
  llvm-openmp               20.1.4        h024ca30_0               conda-forge
  lz4-c                     1.10.0        h5888daf_1               conda-forge
  metis                     5.1.0         hd0bcaf9_1007            conda-forge
  mpi                       1.0.1         mpich                    conda-forge
  mpich                     4.3.0         h1a8bee6_100             conda-forge
  mumps-include             5.7.3         h23d43cc_10              conda-forge
  mumps-mpi                 5.7.3         h8c07e11_10              conda-forge
  ncurses                   6.5           h2d0b736_3               conda-forge
  ninja                     1.12.1        hff21bea_1               conda-forge
  numpy                     2.2.5         py313h17eae1a_0          conda-forge
  openssl                   3.5.0         h7b32b05_1               conda-forge
  packaging                 25.0          pyh29332c3_1             conda-forge
  parmetis                  4.0.3         hc7bef4e_1007            conda-forge
  petsc                     3.23.1        real_hf9cfe27_0          conda-forge
  pip                       25.1.1        pyh145f28c_0             conda-forge
  pkg-config                0.29.2        h4bc722e_1009            conda-forge
  pluggy                    1.5.0         pyhd8ed1ab_1             conda-forge
  pugixml                   1.15          h3f63f65_0               conda-forge
  pytest                    8.3.5         pyhd8ed1ab_0             conda-forge
  python                    3.13.3        hf636f53_101_cp313       conda-forge
  python_abi                3.13          7_cp313                  conda-forge
  rdma-core                 57.0          h5888daf_0               conda-forge
  readline                  8.2           h8c095d6_2               conda-forge
  rhash                     1.4.5         hb9d3cd8_0               conda-forge
  scalapack                 2.2.0         h7e29ba8_4               conda-forge
  superlu                   7.0.1         h8f6e6c4_0               conda-forge
  superlu_dist              9.1.0         h0804ebd_0               conda-forge
  sysroot_linux-64          2.17          h0157908_18              conda-forge
  tk                        8.6.13        noxft_h4845f30_101       conda-forge
  tomli                     2.2.1         pyhd8ed1ab_1             conda-forge
  tzdata                    2025b         h78e105d_0               conda-forge
  ucc                       1.3.0         had72a48_5               conda-forge
  ucx                       1.18.0        h1369271_4               conda-forge
  xtensor                   0.25.0        h00ab1b0_0               conda-forge
  xtl                       0.7.7         h00ab1b0_0               conda-forge
  xz                        5.8.1         hbcc6ac9_1               conda-forge
  xz-gpl-tools              5.8.1         hbcc6ac9_1               conda-forge
  xz-tools                  5.8.1         hb9d3cd8_1               conda-forge
  yaml                      0.2.5         h7f98852_2               conda-forge
  zstd                      1.5.7         hb8e6e7a_2               conda-forge

vsoch avatar May 09 '25 18:05 vsoch

The highfive version is too old. The correct version is 2.10. I don't know how this could have happened ! The versions of the others dependencies seem good.

gouarin avatar May 09 '25 19:05 gouarin

I tried updating, but it seems that somehow triggered hdf5 to not be parallel?

Image

That is the initial cmake command (which was working before). I'm not sure what to try next.

vsoch avatar May 11 '25 15:05 vsoch

I tried and here is what works for me

  • mpi-environment.yaml
name: samurai-env
channels:
  - conda-forge
dependencies:
  - cmake
  - ninja
  - xtensor<0.26
  - highfive>=2.10
  - fmt
  - pugixml
  - cxxopts
  - cli11<2.5
  - pytest
  - h5py
  - openmpi
  - libboost-mpi<1.87 # MPI seems broken with 1.87 with the error : symbol not found in flat namespace '_PyBaseObject_Type'
  - libboost-devel
  - libboost-headers
  - hdf5=*=mpi*
  • Dockerfile
ARG base=ghcr.io/converged-computing/flux-openmpi:ubuntu2204
FROM ${base}
WORKDIR /opt

RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh && \
    chmod +x Miniforge3-Linux-x86_64.sh && \
    bash Miniforge3-Linux-x86_64.sh -p /opt/miniconda -b

ENV PATH=/opt/miniconda/bin:$PATH
COPY ./mpi-environment.yaml ./mpi-environment.yaml
RUN mamba env create --file ./mpi-environment.yaml
SHELL ["conda", "run", "-n", "samurai-env", "/bin/bash", "-c"]

RUN mamba install -y cxx-compiler
RUN git clone https://github.com/hpc-maths/samurai /opt/samurai && \
    cd /opt/samurai && \
    cmake . -Bbuild -GNinja \
    -DCMAKE_BUILD_TYPE=Release \
    -DWITH_MPI=ON \
    -DBUILD_DEMOS=ON && \
    cmake --build ./build --config Release --target finite-volume-advection-2d

Hope this will help !

gouarin avatar May 11 '25 19:05 gouarin

Thank you! I should be able to test again this week! I'm epically flailing with ebpf in containers at the moment. 😆

vsoch avatar May 12 '25 01:05 vsoch

ok - container is built and finite volume is working in a test environment!

Image

Here are the options I see:

Image

How should I run this starting at 4 nodes up to (likely) 64? What kind of scaling, what parameters to set (and how to change, if needed, aside from the nodes and tasks) and what is the FOM?

I'm off to bed but can run these tomorrow.

vsoch avatar May 22 '25 06:05 vsoch

Nice !

The first thing you can do is to try to run the example without the adaptive mesh. To do this, you have to set min_level=max_level. If you are doing a strong scaling measure, you can start with a max_leve=min_level=14. For the weak scaling, you need to increase the number of levels depending on the number of subdomains. The domain is divided in the y-direction and the number of cells in each direction is $2^{level}$.

gouarin avatar May 22 '25 18:05 gouarin

Do you have an example with strong scaling? Or just use min and max == 14 and give more resources?

vsoch avatar May 22 '25 22:05 vsoch

I would basically do:

flux run -N4 -n352 ./finite-volume-advection-2d --min-level=14 --max-level=14
flux run -N8 -n704 ./finite-volume-advection-2d --min-level=14 --max-level=14
...

Up to the largest size (likely 64). Is the FOM just the duration?

vsoch avatar May 22 '25 22:05 vsoch

Yes. I think it is enough. Maybe, you should remove the I/O, because otherwise you will write a big file and most of the time will be consumed by this task.

To remove the I/O, you just have to comment the lines where you have the save function called in the main function.

gouarin avatar May 23 '25 04:05 gouarin