ham icon indicating copy to clipboard operation
ham copied to clipboard

valgrind reports a ton of 'Uninitialised byte(s) found during client check request'

Open marehr opened this issue 6 years ago • 2 comments

Look at ./inner_product_mpi

> mpirun -n 1 ./inner_product_mpi : -n 1 valgrind ./inner_product_mpi : -n 2 ./inner_product_mpi 
==28739== Memcheck, a memory error detector
==28739== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28739== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==28739== Command: ./inner_product_mpi
==28739== 
==28739== Uninitialised byte(s) found during client check request
==28739==    at 0x6814E31: ??? (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==28739==    by 0x533310E: PMPI_Allgather (in /usr/lib/openmpi/libmpi.so.40.10.0)
==28739==    by 0x152777: ham::net::communicator::communicator(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x152016: ham::offload::runtime::runtime(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x15FC68: ham::offload::ham_main(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x14801E: main (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==  Address 0x1ffefff5e6 is on thread 1's stack
==28739==  in frame #2, created by ham::net::communicator::communicator(int, char**) (???:)
==28739== 
Using target node 1 with hostname t470p
==28739== Uninitialised byte(s) found during client check request
==28739==    at 0x6814E31: ??? (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==28739==    by 0x5365F2C: PMPI_Send (in /usr/lib/openmpi/libmpi.so.40.10.0)
==28739==    by 0x150F7C: void ham::net::communicator::request::send_result<void>(void*, unsigned long) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x151462: ham::offload::detail::offload_result_msg<ham::new_buffer<double>, ham::msg::execution_policy_direct>::operator()() (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x150D96: ham::msg::execution_policy_direct<ham::offload::detail::offload_result_msg<ham::new_buffer<double>, ham::msg::execution_policy_direct> >::handler(void*) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x152B40: ham::msg::active_msg_base::operator()(void*) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x1521ED: ham::offload::runtime::run_receive() (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x15FCA5: ham::offload::ham_main(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x14801E: main (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==  Address 0x1ffefff6cc is on thread 1's stack
==28739==  in frame #3, created by ham::offload::detail::offload_result_msg<ham::new_buffer<double>, ham::msg::execution_policy_direct>::operator()() (???:)
==28739== 
Result: 1.78957e+08
==28739== 
==28739== HEAP SUMMARY:
==28739==     in use at exit: 40,241 bytes in 423 blocks
==28739==   total heap usage: 20,553 allocs, 20,130 frees, 4,370,387 bytes allocated
==28739== 
==28739== LEAK SUMMARY:
==28739==    definitely lost: 12,592 bytes in 153 blocks
==28739==    indirectly lost: 8,657 bytes in 215 blocks
==28739==      possibly lost: 0 bytes in 0 blocks
==28739==    still reachable: 18,992 bytes in 55 blocks
==28739==         suppressed: 0 bytes in 0 blocks
==28739== Rerun with --leak-check=full to see details of leaked memory
==28739== 
==28739== For counts of detected and suppressed errors, rerun with: -v
==28739== Use --track-origins=yes to see where uninitialised values come from
==28739== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)

If you look closely it always happens within /usr/lib/openmpi/libopen-pal.so. I want to make sure that those invocations are done in the right way and the reported problems are due to /usr/lib/openmpi/libopen-pal.so.

I created a suppression file for valgrind which suppresses those warnings (see attachment)

==28925== Memcheck, a memory error detector
==28925== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28925== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==28925== Command: ./inner_product_mpi
==28925== 
Using target node 1 with hostname t470p
Result: 1.78957e+08
==28925== 
==28925== HEAP SUMMARY:
==28925==     in use at exit: 40,276 bytes in 423 blocks
==28925==   total heap usage: 20,553 allocs, 20,130 frees, 4,370,422 bytes allocated
==28925== 
==28925== LEAK SUMMARY:
==28925==    definitely lost: 12,592 bytes in 153 blocks
==28925==    indirectly lost: 8,657 bytes in 215 blocks
==28925==      possibly lost: 0 bytes in 0 blocks
==28925==    still reachable: 19,027 bytes in 55 blocks
==28925==         suppressed: 0 bytes in 0 blocks
==28925== Rerun with --leak-check=full to see details of leaked memory
==28925== 
==28925== For counts of detected and suppressed errors, rerun with: -v
==28925== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 2)

marehr avatar May 13 '18 22:05 marehr

==23332== Syscall param process_vm_readv(lvec[...]) points to unaddressable byte(s)
==23332==    at 0x601235A: process_vm_readv (in /usr/lib/libc-2.27.so)
==23332==    by 0xE54EE93: mca_btl_vader_get_cma (in /usr/lib/openmpi/openmpi/mca_btl_vader.so)
==23332==    by 0xEF68E0F: mca_pml_ob1_recv_request_get_frag (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xEF692CB: mca_pml_ob1_recv_request_progress_rget (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xEF644F9: ??? (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xEF64773: ??? (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xE54D0BE: mca_btl_vader_poll_handle_frag (in /usr/lib/openmpi/openmpi/mca_btl_vader.so)
==23332==    by 0xE54D404: ??? (in /usr/lib/openmpi/openmpi/mca_btl_vader.so)
==23332==    by 0x67BE6BB: opal_progress (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==23332==    by 0x67C5295: ompi_sync_wait_mt (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==23332==    by 0x5320DFA: ompi_request_default_wait_all (in /usr/lib/openmpi/libmpi.so.40.10.0)
==23332==    by 0x536E5EE: PMPI_Waitall (in /usr/lib/openmpi/libmpi.so.40.10.0)
==23332==  Address 0xe9c9900 is 0 bytes inside a block of size 1,048,576 alloc'd
==23332==    at 0x4C2F246: memalign (vg_replace_malloc.c:857)
==23332==    by 0x4C2F361: posix_memalign (vg_replace_malloc.c:1020)
==23332==    by 0x1554B2: local_allocate(unsigned long) (benchmark_ham_offload.cpp:87)
==23332==    by 0x155B9C: ham_user_main(int, char**) (benchmark_ham_offload.cpp:181)
==23332==    by 0x17295B: ham::offload::runtime::run_main(int, char**) (runtime.cpp:38)
==23332==    by 0x176D88: ham::offload::ham_main(int, char**) (main.cpp:37)
==23332==    by 0x15554D: main (benchmark_ham_offload.cpp:101)
==23332== 

A different one

marehr avatar May 25 '18 20:05 marehr

The uninitialised bytes are probably a result of the internal buffer size and the actually allocated buffers size (one page by default). It wouldn't make much sense to initialise the whole buffer just to make valgrind happy.

Could also be internal OpenMPI stuff, if you can try MPICH and see if things change.

The unadressable byte for the process_vm_readv syscall seems to be a false positive, or if not out of my control.

I think when dealing with network buffers and DMA accesses, there's plenty of theses errors to be expected.

noma avatar Jul 03 '18 12:07 noma