ert icon indicating copy to clipboard operation
ert copied to clipboard

Detect over-spending of CPU

Open berland opened this issue 1 year ago • 7 comments

The forward model step runner (fm_dispatch) already reports memory back to the Ert application. It can similarly report back cpu time consumption. Today the ERT gui reports the wall-clock time duration of a forward model step.

It is also possible to ask the OS for the cpu time for a process and its descendants. When the ERT gui receives this information, it can compare CPU time to wall clock time and detect if parallelization has been at play, and if so, compare it to NUM_CPU. Typically, if NUM_CPU is 1 and a process has used significant time and it has been in parallel, it should be reported back to the user as a warning.

berland avatar Sep 10 '24 12:09 berland

Proof of concept:

$ /usr/lib64/openmpi/bin/mpicxx -fopenmp -std=c++17 -o omp_mpi omp_mpi.c -lgomp
$ cat runme.sh 
/usr/lib64/openmpi/bin/mpirun -np 8 ./omp_mpi
$ time bash runme.sh
I'm thread 2 out of 10 on MPI process nr. 2 out of 8, while hardware_concurrency reports 10 processors
I'm thread 6 out of 10 on MPI process nr. 2 out of 8, while hardware_concurrency reports 10 processors
[...snip...]
I'm thread 4 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 5 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 6 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 1 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors

real	0m5.723s
user	0m31.502s
sys	0m11.343s
$ cat mpmpi.c 
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
#include <time.h>

int main(int args, char *argv[]) {
    int rank, nprocs, thread_id, nthreads, cxx_procs;
    MPI_Init(&args, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    #pragma omp parallel private(thread_id, nthreads, cxx_procs)
    {
        const double ticks_per_sec = (double)CLOCKS_PER_SEC;
        clock_t start = clock();

        thread_id = omp_get_thread_num();
        nthreads = omp_get_num_threads();
        cxx_procs = std::thread::hardware_concurrency();
        std::stringstream omp_stream;
        omp_stream << "I'm thread " << thread_id
        << " out of " << nthreads
        << " on MPI process nr. " << rank
        << " out of " << nprocs
        << ", while hardware_concurrency reports " << cxx_procs
        << " processors\n";
        std::cout << omp_stream.str();
	volatile double dummy;
	while (1) {
           for (int i = 0; i < 1000; ++i) {
               dummy = i * 3.14159;
           }
           double elapsed = (double)(clock() - start) / ticks_per_sec;
           if (elapsed >= 200)
               break;
        }
    }
    MPI_Finalize();
    return 0;
}

berland avatar Sep 10 '24 12:09 berland

Experimenting with OMP_NUM_THREADS and the -np option, it seems we can only detect the case when -np is more than 1.

berland avatar Sep 10 '24 13:09 berland

tested some on a RHEL8 node. time seems to always give correct numbers. the weird thing is that mpirun with -np 1 or 2 restricts openmp from running on any other physical core. you can see it if you run htop in a different terminal.

time OMP_NUM_THREADS=10 /usr/lib64/openmpi/bin/mpirun -np 1 ./omp_mpi This will show 1 process at 100% and others at 10%, but you can also see in htop that only 1 core is utilized. time will show same user and total time.

If you increase -np to 3 or higher then it will no longer pin the cores the program runs on.

JHolba avatar Sep 11 '24 16:09 JHolba

adding --bind-to core when using -np >= 3 will give the same behavior as -np 1 or 2 adding --bind-to none will give the same behavior as -np >= 3 you can add --report-bindings to see if processes are bound to cores or not

--bind-to core -cpus-per-proc 2 will bind 2 cores per process. -np 2 --bind-to core -cpus-per-proc 2 will give 2 processes bound to 2 cores each

JHolba avatar Sep 11 '24 16:09 JHolba

Blocked by https://github.com/equinor/ert/issues/10057

berland avatar Mar 04 '25 11:03 berland

The blocking issue was closed @berland , what is the current status of this issue?

eivindjahren avatar Aug 25 '25 06:08 eivindjahren

The tick-box in the issue text has is not checked off:

Display warning to user in the GU

berland avatar Aug 25 '25 06:08 berland

🎉

oyvindeide avatar Jan 12 '26 13:01 oyvindeide