lz4mt icon indicating copy to clipboard operation
lz4mt copied to clipboard

Possible memorypool implementation problem

Open ghost opened this issue 11 years ago • 10 comments

I am having problems running lz4mt, specifically when decompressing, it stalls at some point.

Using valgrind and 3 of its different tools (memcheck, helgrind and DRD) all of them stop when decompressing with errors:

-memcheck: complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd.

-helgrind: complains about possible data races when reading or writing by a given thread

-DRD: same as helgrind but different description, complains about conflicting load/store by a given thread

These stalls are definitely happening with lz4mtdecompress in multithread mode. With the single thread mode I am not sure yet. Common line on the output by the 3 tools: ==procID== by 0xADDRESS lz4mtDecompress::{lambda(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)#1}::operator()(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)

I've tested with both gcc 4.7.2 and gcc 4.8.1, locally on my laptop running archlinux with recent packages (linux 3.11.1), and remotely on a cluster running centos with older linux 2.6.18.

ghost avatar Sep 26 '13 10:09 ghost

Hi samuel, thanks for the report ! I've checked your problem.

Summary

  • I've reproduce your problem partly.

Questions

  • Could you describe just a bit more detail about "stall" ?
    • eg. Segfault, Silence with 0% CPU usage, Eat all CPU cycles, etc.
  • What kind of data did you use ?
  • Could you reproduce your problem with enwik8 ?
  • Could you show me your full valgrind log ?

Result

  • I've reproduced
    • Possible data race error with valgrind --tool=helgrind
    • Conflict with valgrind --tool=drd
  • I could not reproduce
    • Stall / stop when decompressing
    • Invalid reads/writes error with valgrind --tool=memcheck

Here is a full result.

Todo

  • Investigate about valgrind's errors.
  • Reproduce samuel's problem.

t-mat avatar Sep 26 '13 16:09 t-mat

Thanks for your fast reply Mr. Takayuki :)

Sorry for my incomplete report, I will try to get you all the information you asked. But for now I only have time for some answers:

  • Stall: Silence with 0%CPU usage
  • I am using binary and text data
  • I can with a certain condition I forgot to mention, output is to null I think this will be enough to reproduce it: for i in {1..100}; do ./lz4mt481_12Sep_omp -dy --lz4mt-thread=0 enwik8.lz4 null; done wait a little bit and it will eventually happen.

ghost avatar Sep 26 '13 17:09 ghost

output is to null

It seems that null is key to this problem. I could not reproduce "stall", but always got std::future_error by the following command:

$ ./lz4mt -d -y enwik8.linux.lz4.c0 null
terminate called after throwing an instance of 'std::future_error'
  what():  No associated state
Aborted (core dumped)

Sorry for my incomplete report,

No problem. It's a good report. This issue list is not for QA/Debug team, so smaller report is good start point :smile:

Todo

  • Investigate null output problem.
  • Investigate about valgrind's errors.
  • Reproduce @samalm321's "stall" problem.

t-mat avatar Sep 27 '13 02:09 t-mat

Here are 4 logs from valgrind, for enwik8 and a binary dataset I have named msg_bt.bin, for both helgrind and drd tools:

  • http://pastebin.com/zQs5QZ0g helgrind_enwik8.log
  • http://pastebin.com/nhLpKPUB helgrind_msgbt.log
  • http://pastebin.com/4m0R8FQf drd_enwik8.log
  • http://pastebin.com/zYSyNApJ drd_msgbt.log

Build environment:

$ uname -r
2.6.18-128.1.14.el5`

 $ gcc -v       
Using built-in specs.
COLLECT_GCC=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2/gcc
COLLECT_LTO_WRAPPER=/home/cpd18777/gentoo_prefix/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /home/cpd18777/gentoo_prefix/var/tmp/portage/sys-devel/gcc-4.7.2-r1/work/gcc-4.7.2/configure --prefix=/home/cpd18777/gentoo_prefix/usr --bindir=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2 --includedir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include --datadir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2 --mandir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/man --infodir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/info --with-gxx-include-dir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --enable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --disable-multilib --with-multilib-list=m64 --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/python --enable-checking=release --disable-libgcj --enable-libstdcxx-time --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --with-local-prefix=/home/cpd18777/gentoo_prefix/usr --enable-__cxa_atexit --enable-clocale=gnu --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.7.2-r1 p1.5, pie-0.5.5'
Thread model: posix
gcc version 4.7.2 (Gentoo 4.7.2-r1 p1.5, pie-0.5.5) 

Run environment:

$ uname -r
2.6.32-279.22.1.el6.x86_64

$ valgrind --version
valgrind-3.8.1 

Using the most recent version of lz4mt and compiled with -O0 -g so valgrind can output more info.

The stall problem is getting me puzzled, I can't reproduce it always. I think it only happens when lz4mt is executed many times quickly (like in that for loop example) and each execution takes a very small amount of time. enwik8 tends to produce almost always that std::future_error, but for example the other dataset msg_bt.bin the error almost never happens and the execution stall with 0%cpu usage after some iterations.

ghost avatar Sep 27 '13 18:09 ghost

Thanks for the logs. I'm checking your report.

2a8ed67, I've resolved std::future_error caused by null output.

fb61bf3, I've resolved valgrind --tool=memcheck's "possibly lost" warning

  • This is a false positive. An instance of Opt is still exist when exit() is called.
  • To prevent this warning, I've splitted out real function and main() and exit()s.

Todo

  • [x] Investigate null output problem.
  • [ ] Investigate valgrind --tool=helgrind.
  • [ ] Investigate valgrind --tool=drd.
  • [ ] Reproduce valgrind --tool=memcheck's warning
    • complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd
  • [ ] Reproduce "stall" problem.

t-mat avatar Sep 28 '13 15:09 t-mat

MEMO TO ME

The GNU C++ Library 3. Using - Debugging Support - Data Race Hunting http://gcc.gnu.org/onlinedocs/libstdc++/manual/debug.html#debug.races

c++ - std::thread problems - Stack Overflow http://stackoverflow.com/questions/10618142/stdthread-problems

t-mat avatar Sep 28 '13 18:09 t-mat

MEMO

Bug 327881 - False Positive Warning on std::atomic_bool ( helgrind @ valgrind 3.9.0 ) https://bugs.kde.org/show_bug.cgi?id=327881

valgrind-variant https://code.google.com/p/valgrind-variant/source/browse/trunk/valgrind/drd/tests/std_thread.cpp?spec=svn129&r=129

// Test whether no race conditions are reported on std::thread. Note: since
// the implementation of std::thread uses the shared pointer implementation,
// that implementation has to be annotated in order to avoid false positives.

I still did not investigate this issue, but I believe there are real problems and false positives. It seems like Valgrind (3.8.1 and 3.9.0) has some problem with std::atomic_* and std::shared_ptr.

t-mat avatar Dec 07 '13 04:12 t-mat

MEMO

GCC Bugzilla - Bug 51504 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51504

See comment #2 and #3.

Current state of drd and helgrind support for std::thread http://stackoverflow.com/q/8393777/2132223

t-mat avatar Mar 26 '14 00:03 t-mat

For me (@t-mat) - investigate about std::condition_variable which has a possibility to cause 'stall problem'.

progschj / ThreadPool - Deadlock spotted! #11

using "condition_variable_any" seems to fix the problem, so I think the real problem is inside the "condition_variable" implementation.

lot of bugs still unresolved for condition_variable.: this one in particular seems the same (apparently they forgot to fix it): https://svn.boost.org/trac/boost/ticket/4978

t-mat avatar Mar 30 '14 10:03 t-mat