lz4mt
lz4mt copied to clipboard
Possible memorypool implementation problem
I am having problems running lz4mt, specifically when decompressing, it stalls at some point.
Using valgrind and 3 of its different tools (memcheck, helgrind and DRD) all of them stop when decompressing with errors:
-memcheck: complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd.
-helgrind: complains about possible data races when reading or writing by a given thread
-DRD: same as helgrind but different description, complains about conflicting load/store by a given thread
These stalls are definitely happening with lz4mtdecompress in multithread mode. With the single thread mode I am not sure yet.
Common line on the output by the 3 tools: ==procID== by 0xADDRESS lz4mtDecompress::{lambda(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)#1}::operator()(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)
I've tested with both gcc 4.7.2 and gcc 4.8.1, locally on my laptop running archlinux with recent packages (linux 3.11.1), and remotely on a cluster running centos with older linux 2.6.18.
Hi samuel, thanks for the report ! I've checked your problem.
Summary
- I've reproduce your problem partly.
Questions
- Could you describe just a bit more detail about "stall" ?
- eg. Segfault, Silence with 0% CPU usage, Eat all CPU cycles, etc.
- What kind of data did you use ?
- Could you reproduce your problem with enwik8 ?
- Could you show me your full valgrind log ?
Result
- I've reproduced
- Possible data race error with
valgrind --tool=helgrind
- Conflict with
valgrind --tool=drd
- Possible data race error with
- I could not reproduce
- Stall / stop when decompressing
- Invalid reads/writes error with
valgrind --tool=memcheck
Here is a full result.
Todo
- Investigate about valgrind's errors.
- Reproduce samuel's problem.
Thanks for your fast reply Mr. Takayuki :)
Sorry for my incomplete report, I will try to get you all the information you asked. But for now I only have time for some answers:
- Stall: Silence with 0%CPU usage
- I am using binary and text data
- I can with a certain condition I forgot to mention, output is to
null
I think this will be enough to reproduce it:for i in {1..100}; do ./lz4mt481_12Sep_omp -dy --lz4mt-thread=0 enwik8.lz4 null; done
wait a little bit and it will eventually happen.
output is to
null
It seems that null
is key to this problem.
I could not reproduce "stall", but always got std::future_error
by the following command:
$ ./lz4mt -d -y enwik8.linux.lz4.c0 null
terminate called after throwing an instance of 'std::future_error'
what(): No associated state
Aborted (core dumped)
Sorry for my incomplete report,
No problem. It's a good report. This issue list is not for QA/Debug team, so smaller report is good start point :smile:
Todo
- Investigate
null
output problem. - Investigate about valgrind's errors.
- Reproduce @samalm321's "stall" problem.
Here are 4 logs from valgrind
, for enwik8 and a binary dataset I have named msg_bt.bin, for both helgrind
and drd
tools:
- http://pastebin.com/zQs5QZ0g helgrind_enwik8.log
- http://pastebin.com/nhLpKPUB helgrind_msgbt.log
- http://pastebin.com/4m0R8FQf drd_enwik8.log
- http://pastebin.com/zYSyNApJ drd_msgbt.log
Build environment:
$ uname -r
2.6.18-128.1.14.el5`
$ gcc -v
Using built-in specs.
COLLECT_GCC=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2/gcc
COLLECT_LTO_WRAPPER=/home/cpd18777/gentoo_prefix/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /home/cpd18777/gentoo_prefix/var/tmp/portage/sys-devel/gcc-4.7.2-r1/work/gcc-4.7.2/configure --prefix=/home/cpd18777/gentoo_prefix/usr --bindir=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2 --includedir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include --datadir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2 --mandir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/man --infodir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/info --with-gxx-include-dir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --enable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --disable-multilib --with-multilib-list=m64 --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/python --enable-checking=release --disable-libgcj --enable-libstdcxx-time --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --with-local-prefix=/home/cpd18777/gentoo_prefix/usr --enable-__cxa_atexit --enable-clocale=gnu --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.7.2-r1 p1.5, pie-0.5.5'
Thread model: posix
gcc version 4.7.2 (Gentoo 4.7.2-r1 p1.5, pie-0.5.5)
Run environment:
$ uname -r
2.6.32-279.22.1.el6.x86_64
$ valgrind --version
valgrind-3.8.1
Using the most recent version of lz4mt
and compiled with -O0 -g
so valgrind can output more info.
The stall problem is getting me puzzled, I can't reproduce it always. I think it only happens when lz4mt
is executed many times quickly (like in that for loop example) and each execution takes a very small amount of time. enwik8
tends to produce almost always that std::future_error
, but for example the other dataset msg_bt.bin
the error almost never happens and the execution stall with 0%cpu usage after some iterations.
Thanks for the logs. I'm checking your report.
2a8ed67, I've resolved std::future_error
caused by null
output.
fb61bf3, I've resolved valgrind --tool=memcheck
's "possibly lost" warning
- This is a false positive. An instance of
Opt
is still exist whenexit()
is called. - To prevent this warning, I've splitted out real function and
main()
andexit()
s.
Todo
- [x] Investigate null output problem.
- [ ] Investigate
valgrind --tool=helgrind
. - [ ] Investigate
valgrind --tool=drd
. - [ ] Reproduce
valgrind --tool=memcheck
's warning- complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd
- [ ] Reproduce "stall" problem.
MEMO TO ME
The GNU C++ Library 3. Using - Debugging Support - Data Race Hunting http://gcc.gnu.org/onlinedocs/libstdc++/manual/debug.html#debug.races
c++ - std::thread problems - Stack Overflow http://stackoverflow.com/questions/10618142/stdthread-problems
MEMO
Bug 327881 - False Positive Warning on std::atomic_bool ( helgrind @ valgrind 3.9.0 ) https://bugs.kde.org/show_bug.cgi?id=327881
valgrind-variant https://code.google.com/p/valgrind-variant/source/browse/trunk/valgrind/drd/tests/std_thread.cpp?spec=svn129&r=129
// Test whether no race conditions are reported on std::thread. Note: since
// the implementation of std::thread uses the shared pointer implementation,
// that implementation has to be annotated in order to avoid false positives.
I still did not investigate this issue, but I believe there are real problems and false positives.
It seems like Valgrind (3.8.1 and 3.9.0) has some problem with std::atomic_*
and std::shared_ptr
.
MEMO
GCC Bugzilla - Bug 51504 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51504
See comment #2 and #3.
Current state of drd and helgrind support for std::thread http://stackoverflow.com/q/8393777/2132223
For me (@t-mat) - investigate about std::condition_variable
which has a possibility to cause 'stall problem'.
progschj / ThreadPool - Deadlock spotted! #11
using "condition_variable_any" seems to fix the problem, so I think the real problem is inside the "condition_variable" implementation.
lot of bugs still unresolved for condition_variable.: this one in particular seems the same (apparently they forgot to fix it): https://svn.boost.org/trac/boost/ticket/4978