erpc
erpc copied to clipboard
Arbitrated client stucks -> dead lock
Why the kErpcStatus_Timeout check is needed for releasing the semaphore in case of any error during erpc call in TransportArbitrator::receive() function? Because I don't see any function returning kErpcStatus_Timeout value. This check needs to be reverted and should release the semaphore for any erpc related errors.
Hi @SASIDHARAN-dot , thank you for this notice. You are right that kErpcStatus_Timeout is never returned. This piece of code has been integrated within the #121 PR. @pgu-swir , was it intentional to check the kErpcStatus_Timeout in arbitrator receive function? Is there any other TCP transport layer update that has not been added by you into the PR (returning kErpcStatus_Timeout status)? Could you please comment? Thanks.
Sorry I missed that comment earlier. I don't think "reverted" is the right term b/c my change was not to restrict the release to timeout only, but instead was to prevent the deadlock that was happening all the time when the low layer was failing.
You don't see timeout used here b/c I've implemented it on a port or erpc to espressif's esp-idf and it's actually used for UART & SPI, but I've never done a PR, not fully sure what maintainer would think of an espressif port.
Now, from memory, I only did the check on timeout because the idea was that the erpc would unblock and notify a loop on the client or server from which you can recover, like
void rpc_server(void *args) {
int err;
ESP_LOGI(TAG, "starting erpc server");
erpc_mbf_t message_buffer_factory = erpc_mbf_dynamic_init();
erpc_service_t service = create_IO2_service();
erpc_server_init(transport, message_buffer_factory);
erpc_add_service_to_server(service);
do {
err = erpc_server_run();
ESP_LOGD(TAG, "server stopped, re-launching %d", err);
} while (err == kErpcStatus_Timeout);
erpc_server_deinit();
ESP_LOGW(TAG, "server error %d", err);
vTaskSuspend(NULL);
}
or on the server
erpc_server_init(transport, message_buffer_factory);
erpc_add_service_to_server(service);
while (1) {
err = erpc_server_run();
if (err != kErpcStatus_Timeout && err != kErpcStatus_InvalidArgument) break;
ESP_LOGW(TAG, "server stopped, re-launching %d", err);
}
ESP_LOGE(TAG, "server closed %d, rebooting in %u sec", err, REBOOT_DELAY);
erpc_server_deinit();
vTaskDelay((REBOOT_DELAY*1000) / portTICK_PERIOD_MS);
esp_restart();
If you exit on something else than timeout, bother to unlock was probably un-needed as there is no recovery possible. It has been a while so I don't remember all the motivation for just releasing on timeout, but that's what comes back to mind. But again, my PR here was not to limit the semaphores release to timeout, but to have the release just happen
Thank you, @pgu-swir , understand the motivation. Not necessarily the erpc recovery is needed each time an error is returned from the transport layer. What about to introduce the kErpcStatus_FatalError state and use it in this case instead of the kErpcStatus_Timeout? Once kErpcStatus_FatalError is returned from transport all pending clients would be unblocked and the application could restart the erpc. @Hadatko , do you see any weak point of this approach (possible memory leak or dead lock) or would it be OK? Thanks.
I'm a bit confused. The idea of using kErpcStatus_Timeout was to provide a way to reach & alter the outer control loop in a non-fatal way. If the PHY returns a timeout, the semaphores are released cleanly so that outer loop can restart if it wants to, as timeouts are not fatal, but the rest of the system is healthy. If you replace _Timeout by _FatalError, then the outerloop should be incited to reset (at least the erpc part), no?
Do you mean that we should always release semaphore to always give the outerloop a chance to continue but let the PHY says if it is Fatal or Timeout? I'd have to re-read the meaning of other errors. Remember that the deadlock only happens (AFAIR) when we are both client & server
@Hadatko , may I ask for your opinion? (1) would you use _Timeout or _FatalError for a non-destructive error from the transport layer? (2) Should we unblock all pending clients all the time any kind of error is returned from the transport? Thank you.