userland icon indicating copy to clipboard operation
userland copied to clipboard

EGL Context creations hangs after VideoCore crash

Open DjPale opened this issue 10 years ago • 20 comments

I have seen similar issues like this - but they're quite old and maybe not related to the same bug. I've compiled SDL2 according to https://solarianprogrammer.com/2015/01/22/raspberry-pi-raspbian-getting-started-sdl-2/.

At some point the display freezes up and have to be released by SIGKILL signal to the app. After this- all applications which tries to create an EGL context won't start at all. The only way to recover that I have found is to reboot the system.

This has happened under both an RPi2 with newest Raspbian distro Linux rpi2 4.1.6-v7+ #810 SMP PREEMPT Tue Aug 18 15:32:12 BST 2015 armv7l GNU/Linux and also version 1 B with the RetroPie distro Linux retropie 3.18.11+ #781 PREEMPT Tue Apr 21 18:02:18 BST 2015 armv6l GNU/Linux.

I tested a very simple program found here: http://pastebin.com/Vnje5sEe which is using the PI GL API directly (not SDL2), and from what I can see the function call eglCreateContext never returns.

I do not have any exact steps to re-create this error yet - but the fact that some calls never return should nevertheless never happen in my opinion.

DjPale avatar Sep 03 '15 17:09 DjPale

The problem isn't that eglCreateContext doesn't return - it sounds like the gpu has crashed. I suspect that video playback (e.g. hello_video) and quite possibly vcgencmd will also be failing at this point.

It might be worth setting start_debug=1 in config.txt and after the crash running:

sudo vcdbg log msg
sudo vcdbg log assert
sudo vcdbg malloc
sudo vcdbg reloc

Ideally run vcgencmd cache_flush before the malloc/reloc commands, although that command may fail depending on how crashed the gpu is.

Really you need to provide a test app that I can run that provokes the gpu crash. That way I can get the gpu debugger connected and see what the problem is.

Just stating the obvious, but if you are having any stability issues, then disable overclocking before running any tests.

popcornmix avatar Sep 04 '15 12:09 popcornmix

I have seen similar issues like this , after the program is freeze , and I kill it , it can not run it again , only reboot can solve this problem. I have little program with source can repeat this problem. I post it in this link https://www.raspberrypi.org/forums/viewtopic.php?f=67&t=121267

But no one have any comment.

bluefishisme avatar Oct 14 '15 07:10 bluefishisme

@bluefishisme did you ever figure out how to fix your issue? The reason I ask is because I'm experiencing the exact same symptoms you are in that even vgencmd is freezing after openvg calls occur:

ioctl(3, 0xc01cc402

Appears to hang there and all subsequent openvg calls fail.

Also, $ sudo vcdbg log msg shows: 412414.170: vcos_abort: Halting

ykram avatar May 23 '16 03:05 ykram

@ykram Do you have an application I can run on raspbian that provokes the vcos_abort? I could at least then determine the backtrace that resulted in that.

popcornmix avatar May 23 '16 15:05 popcornmix

I can upload the source being used that triggers the issue although it depends on the OpenVG wrapper (ajstarks/openvg repo) and also requires input as it uses IPC to dictate how things get drawn but I can provide a dummy app that can send data so it'd work. What's the best way to get those to you?

On Mon, May 23, 2016 at 9:20 AM, popcornmix [email protected] wrote:

@ykram https://github.com/ykram Do you have an application I can run on raspbian that provokes the vcos_abort? I could at least then determine the backtrace that resulted in that.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/raspberrypi/userland/issues/254#issuecomment-221008628

ykram avatar May 23 '16 15:05 ykram

Just zip/tar up the files I need to run and give me a link (e.g. to dropbox/google drive). I don't need the source just something that when run provokes a vcos_abort.

popcornmix avatar May 23 '16 15:05 popcornmix

I'll try to get this archived and sent to you today. I have to recompile some network specific parts to make it so that you'll be able to reproduce sending/receiving data that the OpenVG calls depend on.

On Mon, May 23, 2016 at 9:44 AM, popcornmix [email protected] wrote:

Just zip/tar up the files I need to run and give me a link (e.g. to dropbox/google drive). I don't need the source just something that when run provokes a vcos_abort.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/raspberrypi/userland/issues/254#issuecomment-221014288

ykram avatar May 23 '16 18:05 ykram

So I've been trying to reproduce this using an application that reads network data and replays it back to the server so that the OpenVG client can read the data and interpret it + display it but I can't get it to crash this way. If I use the application as intended however then it'll crash randomly (vcos_abort()). Is there anyway I can generate a stacktrace/coredump and get you those files to debug?

On Mon, May 23, 2016 at 12:16 PM, Mark M [email protected] wrote:

I'll try to get this archived and sent to you today. I have to recompile some network specific parts to make it so that you'll be able to reproduce sending/receiving data that the OpenVG calls depend on.

On Mon, May 23, 2016 at 9:44 AM, popcornmix [email protected] wrote:

Just zip/tar up the files I need to run and give me a link (e.g. to dropbox/google drive). I don't need the source just something that when run provokes a vcos_abort.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/raspberrypi/userland/issues/254#issuecomment-221014288

ykram avatar May 24 '16 00:05 ykram

It's the gpu that is calling vcos_abort, so arm stacktrace/coredump won't help. It's not possible to capture a gpu stacktrace/coredump.

popcornmix avatar May 25 '16 15:05 popcornmix

Ah, bummer. I'll work on creating a POC that reproduces the issue and will reply back here as soon as I get something created that I can use to reliably reproduce the bug.

On Wed, May 25, 2016 at 9:56 AM, popcornmix [email protected] wrote:

It's the gpu that is calling vcos_abort, so arm stacktrace/coredump won't help. It's not possible to capture a gpu stacktrace/coredump.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/raspberrypi/userland/issues/254#issuecomment-221618103

ykram avatar May 25 '16 17:05 ykram

I'm still trying to get a way to reproduce this reliably but in the mean time I did find where the loop/wait seems to occur, if this is helpful: #0 0x76d8ba40 in do_futex_wait (isem=isem@entry=0x76c29a40 <khrn_queue+76>) at ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:48 #1 0x76d8baf4 in __new_sem_wait (sem=0x76c29a40 <khrn_queue+76>) at ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:69 #2 0x76b51aa4 in vchiu_queue_pop () from /opt/vc/lib/libvchiq_arm.so #3 0x76c02be8 in rpc_recv () from /opt/vc/lib/libEGL.so #4 0x76c132dc in vguLine () from /opt/vc/lib/libEGL.so #5 0x76da9920 in Line () from /usr/lib/libshapes.so #6 0x43b66666 in ?? ()

As I said, still working on getting something that you can run that'll reproduce this for you.

On Wed, May 25, 2016 at 11:40 AM, Mark M [email protected] wrote:

Ah, bummer. I'll work on creating a POC that reproduces the issue and will reply back here as soon as I get something created that I can use to reliably reproduce the bug.

On Wed, May 25, 2016 at 9:56 AM, popcornmix [email protected] wrote:

It's the gpu that is calling vcos_abort, so arm stacktrace/coredump won't help. It's not possible to capture a gpu stacktrace/coredump.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/raspberrypi/userland/issues/254#issuecomment-221618103

ykram avatar Jun 03 '16 18:06 ykram

@ykram any progress on the POC?

Ruffio avatar Sep 04 '16 21:09 Ruffio

@ykram any progress on the POC? (This is second ping...)

Ruffio avatar Dec 30 '16 21:12 Ruffio

We saw a very similar (possibly the same) issue in our firmware. We could reproduce it using hello_triangle.bin. Start it, then call tvservice -p and restart hello_triangle.bin. After a few cycles hello_triangle.bin would not start again and the stack shows that it is hanging in eglContextCreate.

After a lot of digging around we realized that we had enable_hdmi_status=1 set in config.txt. After removing that the issue did not appear again. Do you possibly have that option set as well @DjPale?

@popcornmix Any thoughts about this?

julianscheel avatar Jan 20 '17 15:01 julianscheel

@julianscheel I've just tried:

while : ; do (./hello_triangle.bin &); sleep 2; tvservice -p; sleep 2; killall hello_triangle.bin; done

with and without enable_hdmi_status=1 and it seems to running okay. Is that what you meant?

popcornmix avatar Jan 20 '17 18:01 popcornmix

@popcornmix: Can you try again with this script?

#!/bin/sh
while : ; do
        tvservice -p
        ./hello_triangle.bin &
        PID=$!
        tvservice -p
        sleep 5
        kill $PID

        ./hello_triangle.bin &
        PID=$!
        sleep 5
        kill $PID
done

Starting tvservice immediately before hello_triangle seems to be necessary. With this script, I can reliably trigger the bug in a fully updated raspbian and with enable_hdmi_status=1. It usually takes about 10 iterations of the loop to actually happen.

The second invocation of hello_triangle exists just so that it is easier to check whether or not the bug triggered.

edit: It can also take many more iterations than just 10, but so far, the bug always triggers here eventually.

dennis-hamester avatar Jan 23 '17 16:01 dennis-hamester

Any status on resolving this bug? I'm currently being affected by it, even in 2018 with Raspbian Stretch.

camthesaxman avatar Jun 11 '18 19:06 camthesaxman

I doubt anyone is looking at it, unfortunately it's very low priority, and we have oodles of higher priority stuff to fix/develop.

JamesH65 avatar Jun 12 '18 09:06 JamesH65

@camthesaxman If you have a simple test case you can share that triggers the lockup, then we can investigate the issue.

6by9 avatar Jun 12 '18 09:06 6by9

This issue will be closed within 30 days unless further interactions are posted. If you wish this issue to remain open, please add a comment. A closed issue may be reopened if requested.

JamesH65 avatar Jan 08 '19 13:01 JamesH65