MMseqs2 Default threads number value should use sched

Hello,

default thread value are computed from Parameters::setDefaults https://github.com/soedinglab/MMseqs2/blob/b0e91c12850cf1cf84093c5a751a94679e1ab831/src/commons/Parameters.cpp#L2298 If you run mmseqs on a 96 cores machines with taskset -c 1 or a slurm allocation with a single core requested, mmseqs will use 96 threads on a single core which is unfortunate.

[...]
    threads = 1;
    compressed = WRITER_ASCII_MODE;
#ifdef OPENMP
    char * threadEnv = getenv("MMSEQS_NUM_THREADS");
    if (threadEnv != NULL) {
        threads = (int) Util::fast_atoi<unsigned int>(threadEnv);
    } else {
        #ifdef _SC_NPROCESSORS_ONLN
            threads = sysconf(_SC_NPROCESSORS_ONLN);
        #endif
        if(threads <= 1){
            threads = Util::omp_thread_count();
        }
    }

#endif

If you run mmseqs on a 96 cores machines with taskset -c 1 or a slurm allocation with a single core requested, mmseqs will use 96 threads on a single core which is unfortunate.

Here is a reproducer:

$ cat ccores.c 
#define _GNU_SOURCE  // required to use sched.h
#include <sched.h>
#include <stdio.h>
#include <sys/sysinfo.h>
#include <unistd.h>  // required to use sysconf
 
int main(void) {
    cpu_set_t cs;
    CPU_ZERO(&cs);
    sched_getaffinity(0, sizeof(cs), &cs);
 
    printf("using get_nprocs: %d\n", get_nprocs());
    printf("using sysconf(_SC_NPROCESSORS_ONLN): %d\n", sysconf(_SC_NPROCESSORS_ONLN));

    int n = 0;
    #pragma omp parallel reduction(+:n)
    n += 1;
    printf("using openmp: %d\n", n);
 
    int count = 0;
    for (int i = 0; i < sizeof(cs); i++) {
        if (CPU_ISSET(i, &cs)) { count++; }
    }
    printf("using sched_getaffinity: %d\n", count);


    return 0;
}

$ CFLAGS=-fopenmp make ccores
cc -fopenmp    ccores.c   -o ccores

From a machine with a single 64 cores cpu with HT activated:

$ ./ccores 
using get_nprocs: 128
using sysconf(_SC_NPROCESSORS_ONLN): 128
using openmp: 128
using sched_getaffinity: 128

Using taskset:

$ taskset -c 1 ./ccores 
using get_nprocs: 128
using sysconf(_SC_NPROCESSORS_ONLN): 128
using openmp: 1
using sched_getaffinity: 1

Same program submitted through slurm on a 96 cores server (HT off) with a single core requested (-c 1):

$ srun -c 1 ./ccores
srun: job 55388620 queued and waiting for resources
srun: job 55388620 has been allocated resources
using get_nprocs: 96
using sysconf(_SC_NPROCESSORS_ONLN): 96
using openmp: 1
using sched_getaffinity: 1

The problem with this code is that sysconf(_SC_NPROCESSORS_ONLN) corresponds to the hardware cores count and not the available core to the running process. The omp pragma is working correctly.

I think a more sane behavior for mmseqs would be to use sched_getaffinity as in the folly library from facebook for example (folly/folly/system/HardwareConcurrency.cpp)

#include <folly/system/HardwareConcurrency.h>

#include <thread>

#include <folly/portability/Sched.h>

namespace folly {

unsigned int hardware_concurrency() noexcept {
#if defined(__linux__) && !defined(__ANDROID__)
  cpu_set_t cpuset;
  if (!sched_getaffinity(0, sizeof(cpuset), &cpuset)) {
    auto count = CPU_COUNT(&cpuset);
    if (count != 0) {
      return count;
    }
  }
#endif

  return std::thread::hardware_concurrency();
}

} // namespace folly

This is similar to https://github.com/bshoshany/thread-pool/issues/161 for example.

Thank you for considering this change !

Jean-Baptiste

Jan 06 '25 15:01 jbd

You can't currently use affinity with MMseqs2. The issue AFAIK is that we use multi stage execution that doesn't play well with affinity.

When you call a MMseqs2 workflow (e.g. mmseqs search) it calls internally a shell script that does multiple calls back to itself to start individual modules (e.g. mmseqs prefilter). I think the issue is that the parent process has an affinity mask applied that results in child processes not working correctly.

Jan 07 '25 08:01 milot-mirdita

Hello, thank you for your answer.

I'm sorry if I got it wrong, but the suggestion was not to use affinity.

It is about using sched_getaffinity primitive to get the number of cores accessible to the process and child processes and not which one to use. This number will only be used as the default thread number, and nothing else.

To illustrate further, here is the search help output through slurm requesting only one core on a 96 cores machines:

$ srun -c 1 mmseqs search -h | grep threads
srun: job 55434337 queued and waiting for resources
srun: job 55434337 has been allocated resources
 --threads INT                    Number of CPU-cores used (all by default) [96]

It really should be [1] instead of [96] here (since I requested one core in the slurm srun command), because it would use 96 cores on a single core.

sysconf(_SC_NPROCESSORS_ONLN) will get the hardware core count Util::omp_thread_count() will get the number of accessible cores to the process

Inverting the test in the Parameters::setDefaults() function will do the trick as long as MMseqs2 has been compiled with openmp support. From my point of view, the best solution would be to explicitly use the CPU_ISSET loop or CPU_COUNT(&cs) from my initial examples.

I hope it makes more sense now.

Thank you !

Jan 07 '25 09:01 jbd

Ah, sorry for the misunderstanding. That is a good idea

Jan 07 '25 14:01 milot-mirdita

Default threads number value should use sched_getaffinity