iceoryx icon indicating copy to clipboard operation
iceoryx copied to clipboard

mutex owner died -> POPO__CHUNK_LOCKING_ERROR

Open niclar opened this issue 1 year ago • 57 comments

Hi, we just encountered a fatal crash in roudi (first time on b18411364d1cac72422c53a4b254fad0c9061204 since it came out), roudi logs below.

Maybe an iceoryx "client" process died prior, but no iceoryx logging in them or the syslog indicates this. And this should not crash roudi of course.. -Do you maybe have any pointers as to why we encountered this and how it should be remedied ?

(We're running 21 iceoryx client processes on the box, pinned and isolated, ~1500 publishers, and roudi on highest prio)

Thanks

Required information

Operating system: Ubuntu 22.04.2 LTS

Compiler version: Clang 17.0.5

Eclipse iceoryx version: b18411364d1cac72422c53a4b254fad0c9061204 (v2.0.x, 20231121)

Observed result or behaviour:

^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-f53d20e6d8.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock }  :::  [ 130 ]  Owner died
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent()
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application.
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: ICEORYX error! POPO__CHUNK_LOCKING_ERROR
libc++abi: terminating

Edit by elBoberido:

Related issues

  • #325
  • #2260

niclar avatar Feb 14 '24 10:02 niclar

@niclar this is related to #325. For the history feature there is a mutex involved. RouDi locks the mutex and can insert the samples into the queue without the interference of the publisher. This mutex is only contested when subscribers are added or removed but nevertheless, the publisher holds the lock for a short amount of time when publishing. If the application is terminated during this time and RouDi tries to clean up the resources of the process, it finds the corrupt mutex and terminates. I once tried to fix the problem but unfortunately, the data guarded by the mutex might also be corrupted and more refactoring would have been necessary to fully fix the problem. Other things came up and the issue never really happened in our setup.

This should actually only happen if either an application died or if RouDi assumed that an application died. But then you should see something like Application foo not responding (last response 1509 milliseconds ago) --> removing it. If the monitoring is turned off, then RouDi does not detect the dead application immediately but only when the application tries to re-register. But then there is also a log message. If you don't see any log messages for removing or re-registering an application, then I'm a bit at a loss and it could be anything, including cosmic rays.

elBoberido avatar Feb 14 '24 16:02 elBoberido

Looking at the logs again it looks like roudi incorrectly assumed 4 processes died prior. Don't know why it would assume this though. -Is there a starvation issue in the detection ?

niclar avatar Feb 23 '24 10:02 niclar

@niclar this is related to #1361. I already improved the situation on master

elBoberido avatar Feb 23 '24 17:02 elBoberido

-Looks like that fix is/was (2023-11-01) already in the code running (2023-11-23)

niclar avatar Feb 26 '24 09:02 niclar

Oh, I just saw v2.0.x and thought it came from the release branch.

Is the system time changed abruptly so that jumps can occur?

Other scenarios that lead to this error might be when the threads which update the heartbeat value are not scheduled.

A workaround might be to massively increase the PROCESS_KEEP_ALIVE_TIMEOUT. Unfortunately it can currently not be changed via a cmake paremeter but that would not be difficult to add.

elBoberido avatar Feb 26 '24 15:02 elBoberido

These thread(s) are part of the publisher(s) & not roudi right ? -I reckon they might have been starved in this instance. Some deadline monotonic scheduling or similar would be nice to be able to jack in here..

niclar avatar Feb 27 '24 09:02 niclar

It is a thread of the runtime. I think it should not be too difficult to add a updateHeartbeat method to the runtime. One could then call it manually.

elBoberido avatar Feb 27 '24 12:02 elBoberido

This morning we encountered this fatal error again during (post) the morning restart routine of some iceoryx client processes.

niclar avatar Mar 01 '24 09:03 niclar

Do you have more information, e.g. was the load during that time quite high?

elBoberido avatar Mar 01 '24 14:03 elBoberido

cpu core utilization is low overall with roudi running on a dedicated core interrupt isolated. we received this anew the other day seemingly "spontanously" a few hours after a mid day restart of ~20 roudi client processes.

niclar avatar Mar 07 '24 14:03 niclar

Do you have some logs of spikes in the CPU utilization when the issue occurred?

elBoberido avatar Mar 09 '24 13:03 elBoberido

...we've added core utilization logging now for next time it happens...

niclar avatar Mar 11 '24 13:03 niclar

@elBoberido I looked at the issue and was wondering if we could remove the mutex completely if we sacrifice the history? When on connecting a subscriber to a publisher no history has to be delivered, there should be also no need for a mutex.

@niclar Do you require the history when a subscriber connects?

Another solution to get rid of the mutex would be to transfer the samples without a lock (if this is possible) but then it is possible that a user receives the same sample twice. But only when connecting to a new publisher (and this can maybe filtered out with sequence numbers?!)

elfenpiff avatar Mar 11 '24 14:03 elfenpiff

@elfenpiff, good news, we do not. All pub/sub are instantiated with historyCapacity=0 and historyRequest=0 respectively.

niclar avatar Mar 11 '24 14:03 niclar

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

elBoberido avatar Mar 11 '24 16:03 elBoberido

@elBoberido

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

This we can mitigate with turning monitoring off. But a better alternative would be to have a liveliness QoS that enforces some contracts like that a subscriber as to collect a sample latest after X seconds and the publisher must publisher every Y seconds a sample. But this would require some time-consuming refactoring.

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Yes, you are right. But this could be solved by bringing loffli into the play and a lock-free optional where the atomic signals that the thing is set or not. This is also some major refactoring but I think it is solvable.

elfenpiff avatar Mar 11 '24 17:03 elfenpiff

Well, the monitoring was explicitly turned on as far as I know :sweat_smile:

elBoberido avatar Mar 11 '24 17:03 elBoberido

@niclar did you encounter this problem again?

In the meantime I'm a bit hesitant on exposing the heartbeat triggering as public API. An alternative would be to increase the timeout for the detection of died processes. Currently it is about 1.5 seconds and the heartbeat should be sent every 300ms. If this happens again, can you check the RouDi log output for messages like Application foo not responding (last response #### milliseconds ago) --> removing it and report the amount of milliseconds?

elBoberido avatar Apr 14 '24 18:04 elBoberido

As a matter of fact we did friday evening;

^[[0;90m2024-04-12 18:35:37.238 ^[[0;1;93m[Warn ]^[[m: Application marketview.prd not responding (last response 1511 milliseconds ago) --> removing it ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-bf50be8845.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: ICEORYX error! POPO__CHUNK_LOCKING_ERROR libc++abi: terminating

niclar avatar Apr 15 '24 21:04 niclar

Okay, the deadline was missed only by a little bit. For some reasons the runtime thread which sends the heartbeat was not triggered for a long time. Is this on Windows or on Linux?

As a quick workaround I would suggest to try to change the 5 in this line in iceoryx_posh_types.hpp to 10

constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL; // > PROCESS_KEEP_ALIVE_INTERVAL

It does not fix the problem but should make it much more unrealistic. You could increase the time even more. With e.g. 200 it would be one minute until RouDi would assume a process is dead. The 1.5 seconds came from a time where RouDi did not clean up the resources when a crashed application tried to re-register and therefore we kept it low in order to not have to wait to long. But now it could be indeed set to 1 minute which would not be too bad to clean the remainders of a crashed process up but should virtually prevent killing running applications.

The more I think about it, the more I come to the conclusion to just increase this timeout to eliminate this source for leaving a corrupt mutex behind.

elBoberido avatar Apr 15 '24 21:04 elBoberido

This is on linux, dedicated cores. -I applied the workaround and will get back if we encounter it again.

Thanks!

niclar avatar Apr 16 '24 12:04 niclar

Still encountering the issue

2024-04-22 11:32:58.571 [Warn ]: Application X not responding (last response 3046 milliseconds ago) --> removing it 2024-04-22 11:32:58.579 [Error]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-2ee4ec5b88.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died 2024-04-22 11:32:58.579 [Error]: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() 2024-04-22 11:32:58.579 [Fatal]: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. 2024-04-22 11:32:58.579 [Error]: ICEORYX error! POPO__CHUNK_LOCKING_ERROR

niclar avatar Apr 22 '24 11:04 niclar

This is weird, especially in combination with the high socket usage from #2268.

Just to exclude other possibilities. You are using the plain upstream version from the commit mentioned in the description and not a custom branch with that commit cherry-picked?

elBoberido avatar Apr 22 '24 12:04 elBoberido

It might very well be related to the IPC queues yes.

-Yes, the only difference being the attached "patches"; increased_process_keep_alive_intervall.patch service_description.patch

Build options; -DINTROSPECTION=ON -DIOX_MAX_PUBLISHERS=4096 -DIOX_MAX_SUBSCRIBERS=4096 -DIOX_MAX_SUBSCRIBERS_PER_PUBLISHER=64 -DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=16 -DIOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY=256 -DIOX_MAX_CLIENTS_PER_SERVER=512 -DIOX_MAX_NUMBER_OF_NOTIFIERS=512

Roudi conf;

iox::config::CmdLineArgs_t cmdLineArgs; cmdLineArgs.run = conf.net.enabled; cmdLineArgs.logLevel = iox::log::LogLevel::INFO; cmdLineArgs.monitoringMode = iox::roudi::MonitoringMode::ON; cmdLineArgs.compatibilityCheckLevel = iox::version::CompatibilityCheckLevel::PATCH; //cmdLineArgs.uniqueRouDiId cmdLineArgs.processTerminationDelay = iox::roudi::PROCESS_DEFAULT_TERMINATION_DELAY; cmdLineArgs.processKillDelay = iox::roudi::PROCESS_DEFAULT_KILL_DELAY; //cmdLineArgs.configFilePath;

niclar avatar Apr 22 '24 13:04 niclar

We'll update to head in a day or two and see how that plays out...

niclar avatar Apr 22 '24 14:04 niclar

Those patches should not have any negative effect on the heartbeat. Do you run the iceoryx tests after the build? Since you changed some of the compile time constants it is strongly recommended to run the tests. I think we need to put some emphasis on this in the documentation.

Oh, and could you increase the timeout to to 60_s. If it still fails with such a long timeout, then there is something extremely fishy.

elBoberido avatar Apr 22 '24 14:04 elBoberido

It seems like HEAD might have solved the ipc queue and this cleanup issue , let us run it a while longer, to be sure

niclar avatar Apr 22 '24 16:04 niclar

Yeah, I somehow have the feeling that you still had a version which sent the heartbeat via the unix domain socket.

elBoberido avatar Apr 22 '24 17:04 elBoberido

I managed to get this today again on f00cfe05c0048301bb72ade454771891696ee24c (same settings as above/prior, but now on clang 18.1.4);

^[[0;90m2024-04-23 11:44:43.001 ^[[0;1;93m[Warn ]^[[m: Application X not responding (last response 1550 milliseconds ago) --> removing it ^[[0;90m2024-04-23 11:44:43.006 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_hoofs/posix/sync/source/mutex.cpp:325 { expected<void, MutexLockError> iox::mutex::lock() -> iox_pthread_mutex_lock } ::: [ 130 ] Owner died ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [Fatal Error] [POPO__CHUNK_LOCKING_ERROR (code = 61)] in module [iceoryx_posh (id = 2)] ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [PANIC]

niclar avatar Apr 23 '24 11:04 niclar

I'll try setting the timeout to a minute

niclar avatar Apr 23 '24 12:04 niclar