OSVR-Core icon indicating copy to clipboard operation
OSVR-Core copied to clipboard

Infinite loop in vrpn_Endpoint_IP::handle_udp_messages

Open tomm opened this issue 8 years ago • 9 comments

I'm not sure if to report this bug in vrpn, OSVR-Core or OSVR-RenderManager.

Sometimes the RenderManagerOpenGL*Examples hang when the first call to osvrClientUpdate() is made. I can reproduce this on Debian 9 if I build the examples like this: g++ -g -Wall -lGLEW -lGL -losvrRenderManager -losvrClient -losvrClientKit -losvrCommon RenderManagerOpenGLCAPIExample.cpp -o bad

But they work normally if I build them like this: g++ -g -Wall -lGLEW -lGL -losvrClient -losvrRenderManager -losvrClientKit -losvrCommon RenderManagerOpenGLCAPIExample.cpp -o good

The only difference is the link order. The 'bad' link order is the one used in the cmake build of OSVR-RenderManager, so all these RenderManagerOpenGL*Examples hang on Debian 9.

Debugging this a little, osvrClientUpdate() is ultimately calling vrpn_Endpoint_IP::handle_udp_messages without a timeout, and the handling of these messages appears to be taking long enough that more packets are arriving before the previous has been handled, resulting in the loop not terminating.

tomm avatar Aug 20 '17 11:08 tomm

The VRPN library used for transport has a function Jane_stop_this_crazy_thing() defined in vrpn_Connection.h that is meant to stop this behavior, which happens most often with video data where there is always more data coming. I was assuming that perhaps the client library or RenderManager was calling this, so that the linking order changed the value of that setting somehow, but neither of them seems to call this. I can't explain how the linking order changes things, but using this call to place a limit on the number of packets handled during each loop iteration should stop the infinite loop.

If the client application is taking longer than one inter-arrival time to handle the packets, this will convert the infinite loop into a bunch of dropped packets and latency (UDP drops more-recent packets). Changing the client code to not do heavyweight processing in the callback should remove the problem in a more robust way.

russell-taylor avatar Aug 21 '17 00:08 russell-taylor

Given that this is one of the example programs and indeed the example program is using the state interface rather than the callback interface, this means that OSVR Core is taking too long to handle the update, which is surprising to me. I still can't make sense of how changing the linking order changes things, because VRPN is statically linked into all of the libraries that use it.

russell-taylor avatar Aug 21 '17 00:08 russell-taylor

It is interesting that although it needs the include directories to depend on osvrClient, neither the library code nor the example code require linking against osvrClient (only osvrClientKit) on Windows.

russell-taylor avatar Aug 21 '17 00:08 russell-taylor

I modified the build in the fix-infinite-loop branch of OSVR-RenderManager to not link against osvrClient, which works on Windows but I'm having a link failure on boost/config.hpp, included by boost/units/quantity.hpp, included by osvr/Util/Angles.h when I try to compile RenderManager on Linux so I can't test it there. (Not sure how OSVR-Core is compiling; it must be defining -fext-numeric-literals along the way...) Okay, after defining that compiler flag I can now compile and will test as soon as I reboot to update my X server. Can you pull that branch and test and see if it fixes the problem on Debian. If so, I'll go ahead and issue a pull request.

russell-taylor avatar Aug 21 '17 01:08 russell-taylor

The examples still hang on Debian 9 when built from the fix-infinite-loop branch.

I'm thinking there might be an issue with libvrpn.a being linked in RenderManager, and also in Core.

A few more data points that might be useful: Here's the top 4 functions in the callgrind output of running the 'good' demo: 85,783,736 /home/tom/work/deps/OSVR-Core/vendor/vrpn/vrpn_Connection.C:vrpn_noint_select(int, fd_set*, fd_set*, fd_set*, timeval*) [/usr/local/lib/libosvrCommon.so.0.6] 55,006,938 /home/tom/work/deps/OSVR-Core/vendor/vrpn/vrpn_Shared.C:vrpn_htond(double) [/usr/local/lib/libosvrCommon.so.0.6] 51,649,424 /home/tom/work/deps/OSVR-Core/vendor/eigen/Eigen/src/Jacobi/Jacobi.h:Eigen::JacobiSVD<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 2>::compute(Eigen::Matrix<double, 3, 3, 0, 3, 3> const&, unsigned int) 46,212,986 /home/tom/work/deps/OSVR-Core/vendor/vrpn/vrpn_Connection.C:vrpn_Endpoint_IP::handle_udp_messages(timeval const*) [/usr/local/lib/libosvrCommon.so.0.6]

And here's the top 4 from 'bad': 24,278,260 /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:Eigen::internal::variable_if_dynamic<long, 0>::variable_if_dynamic(long) [/usr/local/lib/libosvrRenderManager.so] 20,941,376 /usr/include/eigen3/Eigen/src/Core/CoreEvaluators.h:Eigen::internal::evaluator<Eigen::PlainObjectBase<Eigen::Matrix<float, 2, 1, 0, 2, 1> > >::evaluator(Eigen::PlainObjectBase<Eigen::Matrix<float, 2, 1, 0, 2, 1> > const&) [/usr/local/lib/libosvrRenderManager.so] 19,140,285 /usr/include/eigen3/Eigen/src/Jacobi/Jacobi.h:void Eigen::internal::apply_rotation_in_the_plane<Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true>, Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true>, double>(Eigen::DenseBase<Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true> >&, Eigen::DenseBase<Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true> >&, Eigen::JacobiRotation const&) [/usr/local/lib/libosvrRenderManager.so] 18,650,966 /build/glibc-p3Km7c/glibc-2.24/elf/dl-lookup.c:do_lookup_x [/lib/x86_64-linux-gnu/ld-2.24.so]

tomm avatar Aug 21 '17 10:08 tomm

These traces are consistent with OSVR-Core doing a lot more work in response to incoming messages in one case than it is in the other, thus not being able to complete processing for one report before the next arrives. I'm assuming that you're running the same server in both cases and thus talking to the same hardware devices, so that the only difference is the linking order.

RenderManager and Core both use and link against VRPN, but they also both use Eigen. If Eigen is being called in one case but not the other, or taking more time in one case than the other, this would cause the behavior you're seeing. When I build RenderManager, I point it at the Eigen header files in Core -- are you pointing at them or at another set? (Considering whether this is a header/library mismatch issue.)

russell-taylor avatar Aug 23 '17 01:08 russell-taylor

I still have the issue when I build OSVR-RenderManager, pointing it at eigen headers in Core

tomm avatar Sep 12 '17 08:09 tomm

That looks like -O0 : if you compile Eigen-using code with optimization totally turned off, then there are a pile of "should be inlined" accessors that should disappear that don't. -Og is probably a better alternative.

On Tue, Sep 12, 2017 at 3:00 AM Tom Morton [email protected] wrote:

I still have the issue when I build OSVR-RenderManager, pointing it at eigen headers in Core

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OSVR/OSVR-Core/issues/562#issuecomment-328773444, or mute the thread https://github.com/notifications/unsubscribe-auth/AADuyZiZisETf6WtJbDmOCKOMdr3U8PKks5shjofgaJpZM4O8lxp .

-- Ryan A. Pavlik, Ph.D. CTO - OSVR Platform Sensics, Inc. www.sensics.com

--

Latest news and blog posts (subscribe here http://sensics.com/subscribe-to-our-mailing-list/ to get weekly updates):

Sep 12: Mozart meets Virtual Reality http://sensics.com/mozart-meets-virtual-reality/

Sep 6: VRguy podcast: Kevin Williams on updates in out-of-home VR http://sensics.com/vrguy-podcast-episode-26-kevin-williams-discussing-updates-home-vr/

Aug 30: Sensics releases next-generation headset for out-of-home VR experiences http://sensics.com/sensics-releases-next-generation-vr-headset-arcades-cinemas-home-vr-experiences/

rpavlik avatar Sep 19 '17 22:09 rpavlik

We just made a merge in OSVR-Rendermanager that may address this issue. It looks like it may have been overlinking. Have a look at latest master or at least fa2e92d5daa2f8c0ff79011cfa8d6a96a427c55b and see if that fixes the issue.

russell-taylor avatar Oct 12 '17 16:10 russell-taylor