[Bug] G_percent is not safe to be called from parallel code
Valgrind with default options enabled complains on SIGPIPE for G_percent when called from multiple threads. Shortened output as it is the same for all modules, kept one report per module.
val249826.log-==249826== Command: v.surf.rst input=elev_points3d layer=1 nprocs=4 tension=40.0 segmax=40 npmin=100 zscale=1.0 cvdev=cvdev_threads -c --o val249826.log-==249826== Process terminating with default action of signal 13 (SIGPIPE) val249826.log-==249826== at 0x4BCB9EE: __syscall_cancel_arch (syscall_cancel.S:56) val249826.log-==249826== by 0x4BC0667: __internal_syscall_cancel (cancellation.c:49) val249826.log-==249826== by 0x4BC06AC: __syscall_cancel (cancellation.c:75) val249826.log-==249826== by 0x4C35935: write (write.c:26) val249826.log-==249826== by 0x4BBC5F4: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1182) val249826.log-==249826== by 0x4BBA8D1: new_do_write (fileops.c:450) val249826.log-==249826== by 0x4BBC7F8: _IO_new_file_xsputn (fileops.c:1256) val249826.log-==249826== by 0x4BBC7F8: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1198) val249826.log-==249826== by 0x4B8B1D1: __printf_buffer_flush_to_file (printf_buffer_to_file.c:59) val249826.log-==249826== by 0x4B8B28F: __printf_buffer_to_file_done (printf_buffer_to_file.c:120) val249826.log-==249826== by 0x4B95E0C: __vfprintf_internal (vfprintf-internal.c:1560) val249826.log-==249826== by 0x4B89E25: fprintf (fprintf.c:32) val249826.log:==249826== by 0x499A32C: G_percent (percent.c:88)
val46931.log-==46931== Command: r.gwflow phead=phead status=status hc_x=hydcond hc_y=hydcond s=poros recharge=recharge top=top bottom=bottom type=unconfined dtime=864000000000 maxit=25 error=1e-06 solver=cholesky output=gwresult budget=water_budget -f
val47445.log-==47445== Command: r.horizon elevation=elevation direction=50 start=0.0 end=360.0 distance=1.0 format=plain nprocs=1 output=test_horizon_output_from_elevation file=-
val50157.log-==50157== Command: r.neighbors input=elevation selection=test_neighbors_selection size=3 method=minimum,perc90,average,m aximum,stddev weighting_function=none nprocs=4 memory=300 output=test_standard_options_selection_threaded_raster_minimum,test_standar d_options_selection_threaded_raster_perc90,test_standard_options_selection_threaded_raster_average,test_standard_options_selection_th readed_raster_maximum,test_standard_options_selection_threaded_raster_stddev
val53541.log-==53541== Command: r.texture input=lsat7_2002_80 nprocs=1 size=3 distance=1 method=sa output=sa
val54633.log-==54633== Command: r.viewshed input=elevation coordinates=634720,216180 observer_elevation=500000 target_elevation=0.0 max_distance=-1.0 refraction_coeff=0.14286 memory=500 output=test_viewshed_from_elevation
Here are raw values passed to G_progress call in r.horizon: 0 2 11 1 8 15 13 4 9 5 10 6 3 14 12 7, which also is totally wrong.
I cannot replicate this bug. I tried multiple times, but I kept getting percentages in the correct order.
@marisn Could you tell me how to get unordered results?
I cannot replicate this bug. I tried multiple times, but I kept getting percentages in the correct order. @marisn Could you tell me how to get unordered results?
There is no need to replicate the result. Just do an analysis of the code:
#pragma omp parallel for schedule(static, 1) default(shared)
for (int j = hor_row_start; j < hor_row_end; j++) {
G_percent(j - hor_row_start, hor_numrows - 1, 2);
As the for loop is split among all threads, there is no guarantee that G_percent calls will receive j as an increasing number, as execution speed in each thread can vary. Thus also reported percentage will not be consistent.
If you don't understand it, just try to run this code:
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel for schedule(static,1)
for (int i = 0; i < 30; i++) {
printf("Thread %d is running number %d\n", omp_get_thread_num(), i);
}
return 0;
}
gcc -O0 -fopenmp -o parallel_print parallel_print.c
And this is before we talk about possible causes of SIGPIPE.
I cannot replicate this bug. I tried multiple times, but I kept getting percentages in the correct order. @marisn Could you tell me how to get unordered results?
There is no need to replicate the result. Just do an analysis of the code:
#pragma omp parallel for schedule(static, 1) default(shared) for (int j = hor_row_start; j < hor_row_end; j++) { G_percent(j - hor_row_start, hor_numrows - 1, 2);As the for loop is split among all threads, there is no guarantee that G_percent calls will receive
jas an increasing number, as execution speed in each thread can vary. Thus also reported percentage will not be consistent.If you don't understand it, just try to run this code:
#include <stdio.h> #include <omp.h> int main() { #pragma omp parallel for schedule(static,1) for (int i = 0; i < 30; i++) { printf("Thread %d is running number %d\n", omp_get_thread_num(), i); } return 0; }
gcc -O0 -fopenmp -o parallel_print parallel_print.cAnd this is before we talk about possible causes of SIGPIPE.
I totally understood your point.
It should look like this, IN THEORY. I can also fix it by making that part serial, IN THEORY.
My point is that it becomes more difficult for me to verify whether I have ACTUALLY addressed the issue IN GRASS or not if I cannot replicate this error. There is no need to replicate that, but it will make our lives easier. So, why not?
Could you tell me which valgrind tool and flags you used to examine the command?
There are two problems:
- when running parallel modules with valgrind integrated into our tests, there is a SIGPIPE. This as well could be an artifact of Python/Valgrind interplay, although strange that it manifests only for parallel modules. Unless core cause is well understood, nothing can be done.
- newly parallelized modules call G_percent inside parallel code areas. This is easy and must be fixed. Quickfix is to move call to G_percent out of loop, but that would diminish its usability. Best solution is to start a separate progress reporting thread and just pass a step complete marker to that thread. That thread then can count completed steps and call G_percent.
@marisn Could you tell me how to get unordered results?
Just run mprime in the background to max out your CPU. G_percent uses static variable to track its progress, thus it is just a matter of time before you'll get e.g. (I removed \b from printf call):
Calculating map 1 of 1 (angle 190.00, raster map <rm_this_190>) 0% 3% 6% 9% 12% 15% 18% 21% 24% 27% 30% 33% 36% 39% 43% 47% 51% 54% 58% 62% 66% 69% 73% 77% 81% 84% 88% 92% 96% 99% 100% 79% 98%
Reason for such output is simple: normally static struct state->prev prevents unordered results (although its update is prone to race condition. Thus if one thread reports that it has done row 10, but next tread reports row 5, counter doesn't go backwards. Still, when counter reaches 100%, it gets reset to -1. If there are any threads that are late in reporting their results, now they are back in the game as -1 is less than 79 (in the output above) an thus counter didn't move backwards – it can print a result.
There also is a chance of race condition if thread is stopped between rows 74 and 75 – it has checked that it can update state->prev but has not done it yet. If another thread updates the value of prev, it will get rewritten as soon as thread execution is resumed. If conditions are right, it might update prev with a smaller value than it already contains.