esp-protocols icon indicating copy to clipboard operation
esp-protocols copied to clipboard

WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387)

Open filzek opened this issue 1 year ago • 10 comments

Answers checklist.

  • [X] I have read the documentation for esp-protocols components and the issue is not addressed there.
  • [X] I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
  • [X] I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

The WebSocket has a problematic lock mechanism that hinders concurrent usage and the dispatch of receiving and sending data. While receiving data packets in a stream and simultaneously sending data, the lock in the esp_websocket_client_task at line:

if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {

can take an excessive amount of time to release, sometimes exceeding 10 seconds. This delay can cause the send operation to fail, which is not aligned with resilient transmission practices.

It appears that the task suffers from a circular non-delay issue, causing it to run continuously without interruption. In a production environment, there should be pauses within the semaphore lock cycle to enable proper multitasking and ensure that user-side operations can function correctly.

filzek avatar Aug 02 '24 18:08 filzek

It has been confirmed that the WebSocket task routine creates a lock that is not properly resolved in concurrent multithreaded tasks. This causes the lock to be taken again before other semaphore requests can be processed.

To solve this issue, add a minor delay of 100ms in the task loop as follows: while (client->run) { ... xSemaphoreGiveRecursive(client->lock); vTaskDelay(100); // Tested with 100ms and it worked well } This small delay allows other tasks to acquire the semaphore, ensuring proper concurrency and improving the overall performance of the WebSocket communication.

I will be preparing this update as well to have a specific option to be set in the websocket config to enable user to adjust the delay needed.

filzek avatar Aug 02 '24 20:08 filzek

Hi @filzek thanks for reporting, to help my investigation of the issue, could you please provide the scenario you got this large time to release? If you could provide more details of your system or maybe provide a reproducer, it would be very helpful.

euripedesrocha avatar Aug 05 '24 12:08 euripedesrocha

We cannot provide a working example since we cannot detach the application, but the scenario where this issue occurs is quite simple. The internal task of the websocket cycles so quickly that the semaphore is only taken by the task itself.

We use a queue to send messages to the websocket. This queue handles a large number of packets, and when multiple packets are received simultaneously from the server, the websocket task creates this lock. We are not sure why the websocket behaves this way, but we have used a deadlock detection mechanism in the code for debugging purposes:

if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {
    ESP_LOGE(TAG, "Failed to lock ws-client tasks, exiting the task...");
    break;
}
printf("LOCK FROM esp_websocket_client_task - [%d]\n", __LINE__);

xSemaphoreGiveRecursive(client->lock);
printf("LOCK RELEASED FROM esp_websocket_client_task - [%d]\n", __LINE__);

This helped us identify the root cause. We found that using vTaskDelay solves the concurrency lock issue. With a 100ms delay, there is no lock, but with a shorter delay of 30ms or none at all, the lock occurs.

On our side, we have several places where we attempt to send messages to the websocket. We use our own semaphore to lock before calling esp_websocket_client_send_text. Despite having a timeout of more than 10 seconds, we still encounter timeouts while trying to send messages. We have split the process into several timeouts and retries without success.

We moved from simple retries to identifying where the issue could occur in the esp_websocket_client_send_text routine. After analyzing this function, we found that the lock was being acquired in the esp_websocket_client_send_with_exact_opcode function, which led us to the following code:

if (xSemaphoreTakeRecursive(client->lock, timeout) != pdPASS) {
    ESP_LOGE(TAG, "Could not lock ws-client within %" PRIu32 " timeout", timeout);
    return -1;
}

The problem was that the lock was being acquired inside the internal task loop. The task was running so fast that the lock was caught immediately after being released by the same routine.

The solution we found so far was to add a vTaskDelay to solve the sequential lock issue caused by the internal task.

filzek avatar Aug 05 '24 14:08 filzek

@filzek thanks for the clarification. The first point is that the presence of this mutex is to allow the user facing API to be called from multiple tasks without the user having to deal with the locks themselves. Have you tested to adjust the priority of the websocket client task? Allowing your task to acquire the mutex when ready. From your description I believe that your extra locking is unnecessary if it is just used to access the websocket client.

I'll still look into the code to find any concurrency problems in the design.

euripedesrocha avatar Aug 06 '24 07:08 euripedesrocha

Hi, I'm encountering the same issue. My goal is to transfer two-way audio streaming in real time, but the fact is I almost have to receive all incoming data before I can send any

ho-229 avatar Sep 08 '24 16:09 ho-229

@euripedesrocha

both scenario same issue, the semaphore got sctucker much more than 10 seconds, there is something happening with the transaction inside the websocket, so, its seens that is not just a concurrence lock, but some logic that locks the semaphore for a long time.

tasks set with low priority: 10 websocket: 20 same issue

tasks set with low priority: 20 websocket: 10 same issue

filzek avatar Oct 14 '24 13:10 filzek

I've seen this also when trying to send a small message while downloading ~100KB. In my case, the websocket RX thread sends messages to the main thread through a FreeRTOS message queue, but sometimes the main thread needs to write to the websocket itself. If the main thread tries esp_websocket_client_send_text(), it is blocked from doing so until the websocket RX thread releases the lock. In our system the main thread is responsible for processing messages from the websocket RX thread, so this causes a deadlock(at least until the queue times out and drops).

General topic: websocket RX thread should not block other threads from websocket TX for too long.

Idea1(1 mutex): websocket RX thread could release the client->lock during WEBSOCKET_EVENT_DATA callback processing. This would allow @filzek to add a vTaskDelay(100); into their handler, and my WEBSOCKET_EVENT_DATA handler writing to a FreeRTOS queue would block similarly only when the queue is full. Both would allow another thread to esp_websocket_client_send_text() in this period.

Idea2(multi-mutex): a mutex for writes, a mutex for reads, a mutex for state. If a reading or writing thread needs to update the websocket state, it also has to get the state-mutex.

bryghtlabs-richard avatar Oct 30 '24 21:10 bryghtlabs-richard

A gross workaround: ensure any threads calling esp_websocket_client_send_text() are bound to the same core as the websocket client thread, and at a higher priority than the websocket client thread. This way the writer interrupts the client reader thread as soon as it unlocks the mutex.

bryghtlabs-richard avatar Oct 30 '24 22:10 bryghtlabs-richard

@euripedesrocha , I've extended the esp_websocket client example to demonstrate this problem: https://github.com/bryghtlabs-richard/esp-protocols/tree/issue625demonstrator

It's a little silly because it uses the echo server to trigger it, but I hope the general problem of mutex starvation in esp_websocket_client between reader and writer threads is clear. In my linked example all writer-data should fit in the default TCP window, yet is still blocked, often for many seconds by reader-thread processing.

If the writer thread did not vTaskDelay(100 / portTICK_PERIOD_MS) between packets, it's also possible for writers to starve the reader thread.

bryghtlabs-richard avatar Nov 01 '24 22:11 bryghtlabs-richard

I checked in with the FreeRTOS forum , there's currently nothing like a fair-mutex that we could use here to limit thread starvation.

bryghtlabs-richard avatar Nov 22 '24 14:11 bryghtlabs-richard