iceoryx icon indicating copy to clipboard operation
iceoryx copied to clipboard

improve process is alive detection

Open elfenpiff opened this issue 2 years ago • 4 comments

Brief feature description

When on high CPU load it is possible that the heartbeat thread does not send its heartbeats in a given time-frame. This can cause roudi to cleanup all resources of the application which missed the heartbeat which may lead to use of resources which are deleted.

The solution should be as efficient as possible and may avoid context switches or sending messages (if possible). One approach could be to use getpgid, which returns the group id of a given pid. If the pid does not exist it will fail. If we could couple this with the process runtime or creation time we can identify a process and check if it is still alive.

Relates

#1380

elfenpiff avatar May 18 '22 18:05 elfenpiff

@elfenpiff This is related to both #611 and #620. We should follow RAII for the resources of the app. I suppose a hierarchical structure as sketched in the .puml would allow easier handling of the resources in shared memory.

mossmaurice avatar May 19 '22 07:05 mossmaurice

@mossmaurice loosely related. But the problem in here is not the handling of shared memory resource.

RouDi falsely assumes that an application has died since the high cpu load prevented that application to send the heartbeat in the required time frame.

elfenpiff avatar May 19 '22 08:05 elfenpiff

@elfenpiff I think there is the possibility to use a pipe or stream socket. AFAIK when the writing end of a pipe/stream socket gets closed, the process with the receiving end would get a POLLHUP via poll

elBoberido avatar May 19 '22 09:05 elBoberido

Some info shared from my side about monitor mode: When CPU load is high, There is a high possibility that "keepalivemsg" can't be sent to roudi within PROCESS_KEEP_ALIVE_TIMEOUT, we use "posix::FileLock::create(runtimeName);" to check that process is really died or not.

qclzdh avatar Jun 20 '22 02:06 qclzdh