Ant-Media-Server
Ant-Media-Server copied to clipboard
Server Crash - A fatal error has been detected by the Java Runtime Environment
Short description
Our antmedia server crashed earlier this week again. We've been discussing ocasional server crashes for the past few months, and while it always seems to be a java process causing it, the reasons aren't always the same. We are running the latest version of antmedia enterprise edition 2.13. You can view the log file here. https://ant4.imctransfer.com/hs_err_pid793147.log
Also something to note: apport is extremely slow in collecting the necessary data after a service crash, and this causes the antmedia server to restart with extreme delay (after many many minutes). For this reason we disabled apport for the time being, so that in case of another crash the service will restart quickly and streams can resume faster. I would suggest you also have a look into this matter, and have apport run in the background, after the service has already been restarted (i believe in the current setup the flow is like this: service crashes, apport collects information, once apport is finished service is restarted --- and what we want is for the service to restart immediately after a crash, and error collection can run afterwards)
Hi @danavramescu, Could you please more details by using the bug report format:
Short description
Brief description of what happened
Environment
- Operating system and version:
- Java version:
- Ant Media Server version:
- Browser name and version:
Steps to reproduce
Expected behavior
Put as much detail here as possible
Actual behavior
Put as much detail here as possible
Logs
Place logs on pastebin or elsewhere and put links here
Ask your questions on Ant Media Github Discussions
Short description: Service crashed, as it happened before, due to a java process/thread
Environment Operating system and version: JRE version: OpenJDK Runtime Environment (17.0.14+7) (build 17.0.14+7-Ubuntu-122.04.1) Java version: Java VM: OpenJDK 64-Bit Server VM (17.0.14+7-Ubuntu-122.04.1, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) Ant Media Server version: 2.13.0 Browser name and version:
- not important, as it is not a browser related issue, but a server crash Steps to reproduce
- unknown
Expected behavior
- stable service that doesn't periodically crash
Actual behavior
- service crashes and restarts
- note here: due to apport service kicking in on a crash, and collecting a large amount of data, the service doesn't quickly restart itself in such a situation. for this reason, after this latest crash, we disabled apport service so that the antmedia service can restart immediately after. I believe at the moment apport runs first after a service crash, collects all neccessary data, and only when finished will it restart the antmedia service. i strongly recommend and request you update this so that the service restarts FIRST (immediately) and apport runs in the background or after the restart, so that the members will be see a less impact in such scenarios. right now it takes even 30-60 minutes for the service to automatically restart if apport is enabled, and we need this to be instant.
Logs (i cannot use pastebin because file contents are too large) https://ant4.imctransfer.com/hs_err_pid793147.log - latest crash log https://ant4.imctransfer.com/hs_err_pid3489791.log -- previous crash log (february 11) - when this happened we were still running antmedia version 2.9.0 - and we were asked to update to the latest version, which we did. but if you look at the 'problematic frame' - it's the same reason as the latest crash log specifies, so the issue itself was not fully fixed with version 2.13.
A new crash just happened , same reason as before Log file here: https://ant4.imctransfer.com/hs_err_pid1357862.log
NOTE: With apport service disabled, the antmedia server restarted quickly now, and unclear what it was even collecting that it was causing 30 minute delays between crash and restart. It seems having it disabled does solve part of the problem, as in the server will at least immediately restart after the crash, but we need you to find the cause of the crash and hopefully prevent it moving forward. it's now the third time the service crashes and the reason is the same one: Problematic frame:
C [libjingle_peerconnection_so.so+0x45289d] webrtc::PeerConnectionProxyWithInternalwebrtc::PeerConnectionInterface::local_description() const+0x9d
Below a snippet from the other logs, in case it might be extra helpful debugging the crash reason:
Mar 14 08:45:51 ant4 kernel: [30375873.270448] clocksource: timekeeping watchdog on CPU96: hpet wd-wd read-back delay of 134863ns Mar 14 08:45:51 ant4 kernel: [30375873.270459] clocksource: wd-tsc-wd read-back delay of 131022ns, clock-skew test skipped! Mar 14 08:45:58 ant4 systemd[1]: antmedia.service: Main process exited, code=killed, status=6/ABRT Mar 14 08:45:58 ant4 systemd[1]: antmedia.service: Failed with result 'signal'. Mar 14 08:45:58 ant4 systemd[1]: antmedia.service: Consumed 1month 4w 1d 5h 43min 6.213s CPU time. Mar 14 08:46:03 ant4 systemd[1]: antmedia.service: Scheduled restart job, restart counter is at 2. Mar 14 08:46:03 ant4 systemd[1]: Stopped Ant Media Server. Mar 14 08:46:03 ant4 systemd[1]: antmedia.service: Consumed 1month 4w 1d 5h 43min 6.213s CPU time. Mar 14 08:46:03 ant4 systemd[1]: Started Ant Media Server.
Hi @danavramescu , I couldn't reproduce this on my environment. Also we haven't received the same logs from any other users. I think, this may be related to your environment. Can we schedule a meeting to test on your sytem?
Hi @danavramescu , I couldn't reproduce this on my environment. Also we haven't received the same logs from any other users. I think, this may be related to your environment. Can we schedule a meeting to test on your sytem?
Hi Burak. Sure, we can schedule one, but let's discuss and schedule over email please.
Hi @danavramescu , We are using jdk17 and having similar issue with AMS server crashing and getting same type of hs_err_pidxxxx.log error log report. Did you find a solution to the issue? Thank you for your reply.
@iowagrade can you please provide hs_err_pidxxxx.log
Hi @USAMAWIZARD,
I am uploading 2 zip files here. One is the hs_err files from using AMS server version 2.15.0 and the other is a similar file that occurred when using webRTC-load-test on another machine that also has jdk17 installed. Thank you for any information you can provide.