volk
volk copied to clipboard
volk_get_index broken, stuck in infinite loop
While debugging gnuradio/gnuradio#5013, I stumbled across volk_get_index
calling itself indefinitely when using volk_8u_x2_encodeframepolar_8u
(at least on Deb11). No matter the reason for finding the right implementation of that kernel failing, this function mustn't go into infinite recursion.
Maybe, we should just actually return -1;
and deal with that (by abort
ing) in the calling functions volk_rank_archs
and volk.tmpl.c
: ${kern.name}_manual
. That would at least make debugging easier.
(gdb) bt
#0 volk_get_index (impl_name=<optimized out>, n_impls=<optimized out>, impl_names=<optimized out>) at ./lib/volk_rank_archs.c:44
#1 volk_rank_archs (kern_name=kern_name@entry=0x7fd65caaf788 "volk_8u_x2_encodeframepolar_8u", impl_names=impl_names@entry=0x7fd65cb8da90 <volk_machine_avx2_64_mmx_orc+67632>, impl_deps=impl_deps@entry=0x7fd65cb8db50 <volk_machine_avx2_64_mmx_orc+67824>,
alignment=alignment@entry=0x7fd65cb8dbb0 <volk_machine_avx2_64_mmx_orc+67920>, n_impls=n_impls@entry=4, align=align@entry=true) at ./lib/volk_rank_archs.c:68
#2 0x00007fd65c986088 in __init_volk_8u_x2_encodeframepolar_8u () at ./obj-x86_64-linux-gnu/lib/volk.c:10997
#3 __volk_8u_x2_encodeframepolar_8u (frame=0x10d7a80 "", temp=0x12c1da0 "", frame_size=16) at ./obj-x86_64-linux-gnu/lib/volk.c:11022
#4 0x00007fd65b998230 in gr::fec::code::polar_decoder_sc_systematic::generic_work(void*, void*) () from /gnuradio/build/gr-fec/lib/libgnuradio-fec.so.3.10.0git
#5 0x00007fd65b95a37f in gr::fec::decoder_impl::general_work(int, std::vector<int, std::allocator<int> >&, std::vector<void const*, std::allocator<void const*> >&, std::vector<void*, std::allocator<void*> >&) () from /gnuradio/build/gr-fec/lib/libgnuradio-fec.so.3.10.0git
#6 0x00007fd65cf2de7a in gr::block_executor::run_one_iteration() () from /gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime.so.3.10.0git
#7 0x00007fd65cfa159a in gr::tpb_thread_body::tpb_thread_body(std::shared_ptr<gr::block>, std::shared_ptr<boost::barrier>, int) () from /gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime.so.3.10.0git
#8 0x00007fd65cf92b44 in gr::thread::thread_body_wrapper<gr::tpb_container>::operator()() () from /gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime.so.3.10.0git
#9 0x00007fd65d1bf787 in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:179
#10 0x00007fd65f0c3ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007fd65ee56def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) n
Thread 3 "fec_decoder4" hit Breakpoint 2, volk_get_index (impl_name=<optimized out>, n_impls=<optimized out>, impl_names=<optimized out>) at ./lib/volk_rank_archs.c:36
36 in ./lib/volk_rank_archs.c
(gdb) bt
#0 volk_get_index (impl_name=<optimized out>, n_impls=<optimized out>, impl_names=<optimized out>) at ./lib/volk_rank_archs.c:36
#1 volk_rank_archs (kern_name=kern_name@entry=0x7fd65caaf788 "volk_8u_x2_encodeframepolar_8u", impl_names=impl_names@entry=0x7fd65cb8da90 <volk_machine_avx2_64_mmx_orc+67632>, impl_deps=impl_deps@entry=0x7fd65cb8db50 <volk_machine_avx2_64_mmx_orc+67824>,
alignment=alignment@entry=0x7fd65cb8dbb0 <volk_machine_avx2_64_mmx_orc+67920>, n_impls=n_impls@entry=4, align=align@entry=true) at ./lib/volk_rank_archs.c:68
#2 0x00007fd65c986088 in __init_volk_8u_x2_encodeframepolar_8u () at ./obj-x86_64-linux-gnu/lib/volk.c:10997
#3 __volk_8u_x2_encodeframepolar_8u (frame=0x10d7a80 "", temp=0x12c1da0 "", frame_size=16) at ./obj-x86_64-linux-gnu/lib/volk.c:11022
#4 0x00007fd65b998230 in gr::fec::code::polar_decoder_sc_systematic::generic_work(void*, void*) () from /gnuradio/build/gr-fec/lib/libgnuradio-fec.so.3.10.0git
#5 0x00007fd65b95a37f in gr::fec::decoder_impl::general_work(int, std::vector<int, std::allocator<int> >&, std::vector<void const*, std::allocator<void const*> >&, std::vector<void*, std::allocator<void*> >&) () from /gnuradio/build/gr-fec/lib/libgnuradio-fec.so.3.10.0git
#6 0x00007fd65cf2de7a in gr::block_executor::run_one_iteration() () from /gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime.so.3.10.0git
#7 0x00007fd65cfa159a in gr::tpb_thread_body::tpb_thread_body(std::shared_ptr<gr::block>, std::shared_ptr<boost::barrier>, int) () from /gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime.so.3.10.0git
#8 0x00007fd65cf92b44 in gr::thread::thread_body_wrapper<gr::tpb_container>::operator()() () from /gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime.so.3.10.0git
#9 0x00007fd65d1bf787 in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:179
#10 0x00007fd65f0c3ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007fd65ee56def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
This is a tail recursion in volk_get_index
. Sadly, it will never terminate (until it causes memory exhaustion in ctest, I guess).
volk_get_index
is a bag of mixed emotions for me:
int volk_get_index(const char* impl_names[], // list of implementations by name
const size_t n_impls, // number of implementations available
const char* impl_name // the implementation name to find
)
{
unsigned int i;
for (i = 0; i < n_impls; i++) {
if (!strncmp(impl_names[i], impl_name, 20)) {
return i;
}
}
// TODO return -1;
// something terrible should happen here
fprintf(stderr, "Volk warning: no arch found, returning generic impl\n");
return volk_get_index(impl_names, n_impls, "generic"); // but we'll fake it for now
}
-
strncmp(a,b,20)
: We'll happily compare equal impls that are only different after the first 20 characters. - It never checks whether the impl that can't be found is the
_generic
one. If that is the case, it just recurses to looking up the generic one, and we end up where we are.
What is confusing is why neither volk_get_info
nor the volk_profile
tool know of volk_8u_x2_encodeframepolar_8u_generic
. In fact, the latter doesn't seem to want to know any implementations of that at all.
Originally posted by @marcusmueller in https://github.com/gnuradio/gnuradio/issues/5013#issuecomment-923369047
Thanks for the report and investigation!
If we could come up with a system that is smarter than "iterate over a list of strings and compare", that'd be great.
A couple of issues:
- We search for a substring and just hope the kernel name does not exceed 20 characters. What would be best practice in these cases? Rely on a string terminator?
- We never check if "generic" is findable. We should do so in case the first loop does not return.
- We need to decide what to do in case, this whole thing fails.
return -1
seems to be correct. - Our code to find, load and assign function pointers for our kernels is rather obscure. I assume we want to think about options to improve it. #440 is another obstacle and I'm not yet convinced #453 would be the correct approach to fix it.
We search for a substring and just hope the kernel name does not exceed 20 characters. What would be best practice in these cases? Rely on a string terminator?
Well, ideally, both the list of implementation and the substring would already be sanitized, so strcmp
would work.
We never check if "generic" is findable. We should do so in case the first loop does not return.
agreed.
We need to decide what to do in case, this whole thing fails. return -1 seems to be correct.
But it needs to come with some error handling.
Our code to find, load and assign function pointers for our kernels is rather obscure. I assume we want to think about options to improve it.
Agreed. I especially think the way the impls of a kernel are stored is inelegant and doesn't reflect our needs: there's separate lists for _u and _a kernels, instead of one sorted-by-speed list, where the byte-aligment is just an integer property of the implementation description, that upon constructing things can be checked. In essence, a sorted list of the following struct would make more sense:
struct impl_descriptor {
unsigned int rank; // from benchmarking, or 0 if not benchmarked yet
machine_emum machine; //e.g. GENERIC, or MMX. Don't lug around strings for comparison.
short alignment;
function_ptr impl; // the actual function pointer we'll call
char* full_machine_name; // or string or whatever, something like generic_presorting or sse4_alternative, or zeroptr if "pure" machine
// leaves us room for future extension, e.g. a field for in-placeness.
};
We could then have a kernel_descriptor
and use it to select the best kernel:
struct kernel_descriptor {
char* name; // 0-terminated
//sorted list of impls, sorted by volk_profile rank, followed by machine, followed by alignment
struct impl_descriptor **implementations; // 0-terminated list of pointers to impl descriptors
};
struct impl_descriptor* find_best_match(struct kernel_descriptor* kernel,
enum machine mach,
short aligment) {
struct impl_descriptor *current_impl = kernel->implementations[0];
while(current_impl != 0) {
if(current_impl->machine == mach && current_impl->alignment <= alignment)
return current_impl;
current_impl++;
}
}
honestly, the whole magic would be in sorting **implementations
, and that would be 4 lines of code in C++11, but is an error-prone ride in C; not quite sure why we need our initialization to be C, not C++?
Regarding C vs. C++:
Everything that goes into libvolk...
is written in C. At least it is supposed to.
All the apps and tools are written in C++. e.g. tests, volk_profile, ...
Further, we provide a C++ wrapper around libvolk. That one needs some fixes that break our API.
If we want to change this, we need to discuss it and make a conscious decision.
An immediate fix here might be:
// add this if statement
if (strncmp(impl_name, "generic", 20)) {
return -1;
}
// proceed
return volk_get_index(impl_names, n_impls, "generic");
Though, it may cause issues wherever the return goes.
We might just up the character limit
if (!strncmp(impl_names[i], impl_name, 42)) // instead of 20
Or we could figure out how many characters are available in impl_names[i]
and use that value. I assume all impl names are \0
terminated.
Besides, I'd like to sketch an idea
Most users just use the function pointer, or _u
and _a
versions thereof. In these cases, we need to load the correct pointers on the first call and then just leave them there. Our current mechanism works but has issues like the one we discuss.
e.g. volk_profile
needs the _manual
function that enables us to select a specific implementation. However, it calls this function as well.
I'd like to split this into 2 parts. One function that returns the desired function pointer and one that does manual
. So users may use the function pointers manually. We had request about that already.
Also, we can better separate the "Find the correct impl" and the "profile an impl" part.
@marcusmueller Thanks for:
podman run --cap-add=SYS_PTRACE -it --rm gnuradio/ci:debian-11-3.10
I'll try it as soon as I can.
I'd like to add a few more specifics. I work in a Docker Container:
FROM debian:bullseye
RUN apt-get update && apt-get install -y libvolk2-dev libvolk2-bin libvolk2-doc
I build and run the container with:
docker build -t debian11-bin .
docker run -it --rm debian11-bin
In that container, I run volk_profile
with various args. Most run through without an issue. The one that causes an issue:
# volk_profile -n -R polar
RUN_VOLK_TESTS: volk_8u_x3_encodepolarpuppet_8u(131071,1987)
generic completed in 1374.23 ms
u_ssse3 completed in 732.772 ms
u_avx2 completed in 718.598 ms
a_ssse3 completed in 741.293 ms
a_avx2 completed in 620.256 ms
Best aligned arch: a_avx2
Best unaligned arch: u_avx2
RUN_VOLK_TESTS: volk_32f_8u_polarbutterflypuppet_32f(131071,1987)
^C
I have to kill the last process because nothing ever happens. Except a threads causes 100% CPU load and the fans on my machine yell at me. On the same machine, source build, but outside that container:
volk_profile -n -R polarbutter
RUN_VOLK_TESTS: volk_32f_8u_polarbutterflypuppet_32f(131071,1987)
generic completed in 6013.26 ms
u_avx completed in 894.084 ms
u_avx2 completed in 861.248 ms
Best aligned arch: u_avx2
Best unaligned arch: u_avx2
So it works on a Ubuntu 20.04 host. @maitbot reports that volk_profile
runs flawlessly on Debian 11 machines. This might be a container issue. Anywaus, let's try a Debian 11 build in a container.
FROM debian:bullseye
RUN apt-get update && apt-get install -y build-essential git cmake python3-mako python3-distutils liborc-dev
Run the container and build VOLK inside:
docker build -t volk-debian11 .
docker run -it -v "$(pwd)":/opt --rm volk-debian11
# cd /opt
# mkdir docker-build
# cd docker-build
# cmake ..
# make -j8
# ./apps/volk_profile -n -R polarbutter
Warning: this IS a dry-run. Config will not be written!
RUN_VOLK_TESTS: volk_32f_8u_polarbutterflypuppet_32f(131071,1987)
generic completed in 5502.05 ms
u_avx completed in 872.164 ms
u_avx2 completed in 877.575 ms
Best aligned arch: u_avx
Best unaligned arch: u_avx
Warning: this was a dry-run. Config not generated
I toggle 2 build options in my VOLK build
--- i/CMakeLists.txt
+++ w/CMakeLists.txt
@@ -49,6 +49,7 @@ if(HAVE_C_LIMITED_RANGE)
endif(HAVE_C_LIMITED_RANGE)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall")
+#set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wall -fsanitize=undefined")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wall")
add_definitions(-D_GLIBCXX_USE_CXX11_ABI=1)
Specifically, I add the build flag -fsanitize=undefined
. If it is commented out, volk_profile -n -R polarbutter
goes through as expected. With -fsanitize=undefined
, RUN_VOLK_TESTS: volk_32f_8u_polarbutterflypuppet_32f(131071,1987)
just hangs. On the plus side, the issue should be easier to reproduce. However, we still don't know what exactly breaks.
Note that this is a bug, but might be a different one.
__init_volk_8u_x2_encodeframepolar_8u
was the infinite-loop-inducing call.
Could you run into the infinite loop, attach GDB and do a backtrace? (you'll need to give
your container SYS_PTRACE
capabilities for that)
Summary: I couldn't reproduce any of those issues. See below.
Unfortunately, I was to quick. Toggling -fsanitize=undefined
has a serious performance impact, but is probably not the cause of this issue.
With -fsanitize=undefined
RUN_VOLK_TESTS: volk_32f_8u_polarbutterflypuppet_32f(131071,100)
generic completed in 6923.07 ms
u_avx completed in 921.302 ms
u_avx2 completed in 921.018 ms
Best aligned arch: u_avx2
Best unaligned arch: u_avx2
Warning: this was a dry-run. Config not generated
without -fsanitize=undefined
:
Warning: this IS a dry-run. Config will not be written!
RUN_VOLK_TESTS: volk_32f_8u_polarbutterflypuppet_32f(131071,100)
generic completed in 310.311 ms
u_avx completed in 45.4236 ms
u_avx2 completed in 58.3035 ms
Best aligned arch: u_avx
Best unaligned arch: u_avx
Warning: this was a dry-run. Config not generated
The generic
version takes >22x the execution time. The u_avx
implementation >20x time. With the default 1987 iterations, it should not be a surprise that "the kernel just hangs" is a result. The volk_profile
test with the Debian 11 binary shows the same poor performance but finishes.
The only clear indicator for the infinite loop would be if your terminal is flooded with
Volk warning: no arch found, returning generic impl
Even with the GNU Radio CI container: gnuradio/ci:debian-11-3.10
I can't reproduce this infinite recursion. I tried volk_profile -R encodepolar -i 100
as well. The same result, it is slow but it finishes.
I can't reproduce the issue here or gnuradio/gnuradio#5013 .
it should not be a surprise that "the kernel just hangs" is a result.
Hm, could you attach a debugger inside the container or at least perf top on the outside? We shouldn't be guessing here, when we have the tools to get backtraces ;)
I did a bit of digging, and this seems to be due to a problematic debian patch: https://github.com/gnuradio/gnuradio/issues/5013#issuecomment-1279645954
Would it be possible to have the build fail if any kernel is missing a generic implementation?
Actually, this would be a good requirement. It may require a bit of digging where to implement this check.
Besides, thanks for your investigation. This sheds quite a bit of light on the issue.
This seems to be the culprit: Debian specific patch to VOLK
I suggest to close this issue because we could trace the source to a patch outside this repo. I hope this bug won't reproduce with later releases.
I agree that this issue can be closed off, since it was not a bug in VOLK itself.
It might still be worth improving the build process so that it fails if any kernel is missing a generic implementation, but that could be tracked in a separate issue.
We found the root cause of this bug outside VOLK. I'm closing this issue now.
Unfortunately this is not entirely fixed. There are some kernels which do not have an implementation named generic
and invoking any of them while VOLK_GENERIC
is set still results in an infinite loop. The kernels are:
-
volk_32f_asin_32f
: only hasu_generic
-
volk_32f_exp_32f
: only hasu_generic
-
volk_32f_s32f_add_32f
:generic
is inaccessible due to a bug; fixed in #641 -
volk_32fc_s32f_magnitude_16i
:generic
is inaccessible due to a bug; fixed in #641 -
volk_32u_reverse_32u
: nogeneric
at all
Initially I thought that asserting the presence of a generic
implementation in every kernel (in volk_kernel_defs.py
) would be a good idea, but the presence of volk_32u_reverse_32u
(with variously named implementations and no generic
) suggests to me that maybe kernels aren't required to have a generic
implementation and therefore code that assumes the presence of generic
is incorrect:
https://github.com/gnuradio/volk/blob/a26a1b808ddd75baeebe660e324f23c36dec6416/lib/volk_rank_archs.c#L29-L32
https://github.com/gnuradio/volk/blob/a26a1b808ddd75baeebe660e324f23c36dec6416/lib/volk_rank_archs.c#L52-L57
run_volk_tests
also searches for the generic
implementation so that it can use it as the reference implementation, but it at least falls back to index 0 when there isn't one:
https://github.com/gnuradio/volk/blob/a26a1b808ddd75baeebe660e324f23c36dec6416/lib/qa_utils.cc#L716-L723