force icon indicating copy to clipboard operation
force copied to clipboard

libgomp: Thread creation failed

Open davidfrantz opened this issue 4 years ago • 2 comments

Reported by @jakimowb via email.

The Level 2 ImproPhe submodule in force-higher level occassionally throws this error:

________________________________________
Progress:                         26.00%
Time for I/C/O:           007%/092%/001%
ETA:             00y 00m 02d 21h 28m 59s
________________________________________
                   input compute  output
Processing unit:      27      26      25
Tile X-ID:            49      49      49
Tile Y-ID:            27      27      27
Chunk ID:             27      26      25
Threads:               8      22       4
Time (sec):          224    2747      33

libgomp: Thread creation failed: Resource temporarily unavailable
double free or corruption (!prev)
[1]    1148 abort (core dumped)  force-higher-level level2imp.prm.workaround/level2imp.prm

davidfrantz avatar Mar 25 '20 09:03 davidfrantz

There still is a general threading issue in force-higher-level.

It mostly surfaces when using the Level 2 ImproPhe submodule.

I guess it is related to the nested parallelism with OpenMP, wherein 3 teams are used to stream the data. The first team reads data from processing unit pu+1, the second team computes data in pu, and the third team outputs data from pu-1. The teams are working simultaneously. Each team can have multipe sub-threads to do the work parallely.

When doing the work sequentially, i.e. teams work sequentially, this issue does not appear.

I suspect that threads are not re-used and new ones are created instead, and that at some point, the maximum number of allowed threads on the system is reached. But this is only a suspicion..

Related to this: the memory footprint of the process keeps growing - which it doesn't when processing sequentially. I wasn't able to track down the problem. Memchecking with valgrind didn't show any memory leak.

davidfrantz avatar Mar 25 '20 09:03 davidfrantz

So how to process the *.prm file sequentially? Do I need to change e.g

NTHREAD_READ = 8
NTHREAD_COMPUTE = 22
NTHREAD_WRITE = 4

to

NTHREAD_READ = 1
NTHREAD_COMPUTE = 1
NTHREAD_WRITE = 1

or should I just avoid to run force-higher-level with parallel, e.g.

`ls *.prm | parallel -j8 force-higher-level  {}

Please note that the error mentioned above occurred running force-higher-level with a single prm file.

jakimowb avatar Mar 27 '20 16:03 jakimowb