esp-idf icon indicating copy to clipboard operation
esp-idf copied to clipboard

WiFi connect fails permanently after reboot (IDFGH-12600)

Open tomasznowik opened this issue 1 year ago • 4 comments

Answers checklist.

  • [X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • [X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • [X] I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.2.1

Espressif SoC revision.

ESP32

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

esp32-wroom-32

Power Supply used.

USB

What is the expected behavior?

WiFi Connect always works - even if device was reset with reset esp_restart() or esp_system_abort("") many times.

What is the actual behavior?

As reported in https://github.com/espressif/esp-idf/issues/11060 WiFi stack can hang in some state that make WiFi connection not possible. I've managed to reproduce the problem using bleprph_wifi_coex example with the following modifications:

  1. Turned on BLE scanning
  2. CONFIG_ESP_WIFI_TASK_PINNED_TO_CORE_1=y
  3. Add esp_system_abort 15 seconds after WiFi connection is established.

Points 1 & 2 are necessary. In the real system point 3 happens by chance.

After a few hours or sometimes after a night the device is not able to connect to WiFi any more. WiFi reconfiguration yields always in such cases: sw txq[0] state(1) is not idle, potential error!.

Steps to reproduce.

Reset the device with esp_restart() or esp_system_abort("") many times or just wait until WiFi Connect will hang by chance.

Debug Logs.

W None | [0;32mI (8878) wifi_prph_coex: retry to connect to the AP [0m
 W None | [0;32mI (8878) wifi_prph_coex: connect to the AP fail [0m
 I 8888 | wifi             new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
 I 8898 | wifi             state: init -> auth (b0)
 I 9898 | wifi             state: auth -> init (200)
 I 9898 | wifi             new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
 W None | [0;32mI (9898) wifi_prph_coex: retry to connect to the AP [0m
 W None | [0;32mI (9898) wifi_prph_coex: connect to the AP fail [0m
 W 13288 | wifi             m f probe req l=0
 W None | [0;32mI (13288) wifi_prph_coex: retry to connect to the AP [0m
 W None | [0;32mI (13288) wifi_prph_coex: connect to the AP fail [0m
 I 13298 | wifi             new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
 I 13298 | wifi             state: init -> auth (b0)
 I 14298 | wifi             state: auth -> init (200)
 I 14298 | wifi             new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
 W None | [0;32mI (14298) wifi_prph_coex: Retries failed. Reconfiguring wifi [0m
 E 14308 | wifi             NAN WiFi stop
 W 19308 | wifi             TX Q not empty: 500, TXQ_BLOCK=17ff
 W 19308 | wifi             force witi stop
 I 19308 | wifi             flush txq
 I 19308 | wifi             stop sw txq
 I 19308 | wifi             lmac stop hw txq
 W 19308 | wifi             sw txq[0] state(1) is not idle, potential error!
 I 19318 | wifi             mode : sta (c8:f0:9e:4e:10:fc)
 I 19318 | wifi             enable tsf
 W None | [0;32mI (19318) wifi_prph_coex: wifi_configure finished. [0m
 W None | [0;32mI (19328) wifi_prph_coex: connect to the AP fail [0m

More Information.

Functions esp_restart(); or esp_system_abort(""); don't help. Power cycle, reset with the button and hard_reset with esp-tool.py always help.

Disabling CONFIG_ESP_PHY_CALIBRATION_AND_DATA_STORAGE or enabling CONFIG_ESP_PHY_RF_CAL_FULL doesn't change the behaviour. The same issue happens with bluedroid. Issue exists on v4.2.4, v5.1.2 and v5.2.1. Example code & logs are attached. I've removed ping feature from the code to make it clearer.

example_code.zip sample_log.zip

tomasznowik avatar Apr 12 '24 05:04 tomasznowik

Hi @tomasznowik , thanks for your report and project!
We are able to reproduce your issue here and looking into you, will keep you updated ASAP.

Espressif-liuuuu avatar Apr 18 '24 07:04 Espressif-liuuuu

Hi @tomasznowik , could you pls help double-check if the issue exists on v5.2.1 when you calling esp_restart instead of esp_system_abort?

I saw you mentioned that it didnot help, while in my place, issue could definitely if esp_system_abort is called while issue gone if esp_restart was called instead. Ive tested for 3 days and everything seems OK with esp_restart.

Actually, things are different for these two APIs. esp_wifi_stop is called when calling esp_restart, which is esstential for safe reboot.

Espressif-liuuuu avatar Apr 22 '24 11:04 Espressif-liuuuu

Hi @Espressif-liuuuu , I started test yesterday evening and so far it looks good. We used esp_restart on v4.2.4 before switching to esp_system_abort for this very reason.

But note that due to unknown bugs or issues in user code or framework an abort may happen anyway from time to time. Is there any way to recovery from this error state?

tomasznowik avatar Apr 23 '24 05:04 tomasznowik

Hi @Espressif-liuuuu I confirm that calling stop_wifi before esp_system_abort prevents wifi issues in long term. But please provide workaround in case abort or hard fault happens and it hangs wifi.

I found that calling ble_gap_disc to start BLE scanning when wifi is hung sometimes (or after some number of tries) makes wifi work again.

tomasznowik avatar Apr 30 '24 09:04 tomasznowik

Hi @Espressif-liuuuu I confirm that calling stop_wifi before esp_system_abort prevents wifi issues in long term. But please provide workaround in case abort or hard fault happens and it hangs wifi.

I found that calling ble_gap_disc to start BLE scanning when wifi is hung sometimes (or after some number of tries) makes wifi work again.

Yes for sure, we are focusing on if some registers were not reset in that case.

Espressif-liuuuu avatar May 08 '24 06:05 Espressif-liuuuu

Hi @tomasznowik , thanks for your report and project! We are able to reproduce your issue here and looking into you, will keep you updated ASAP.

@Espressif-liuuuu Any update about this issue?

Hi, not yet. Its still in test & discussion. We will keep it updated.

Espressif-liuuuu avatar Jun 06 '24 08:06 Espressif-liuuuu

Hi @tomasznowik , we finally find the root cause of the issue, here is the result

There are several essential conditions to triggle the issue

  1. SW_CPU_RESET. Any reset including digital reset wont lead to the issue
  2. BLE must be in scan before reset and must NOT be in scan after reset. If BLE scan starts immediately instead of after Wi-Fi connected, the issue gone as well.
  3. Only ESP32

The root cause is that, there are pair of digital IO operations during the coexist switching when Wi-Fi coexisting with BLE scan. When issue happens, the first operation is done, without the second operation (restore) executed, software reset. After reset, there is no more chance to execute the second operation, leading to Wi-Fi Tx blocked.

To verify the fix of the issue, based on v5.2.1, you can try to replace the libs in fw_update.zip to IDF. There shall be no issue if SW_CPU_RESET with these libs. You need to

  1. put fw_update/esp_wifi/esp32 to $IDF_PATH/components/esp_wifi/lib
  2. put fw_update/esp_coex/esp32 to $IDF_PATH/components/esp_coex/lib
  3. idf.py fullclean
  4. rebuild and test

Please check the initialization log and search below fw information to make sure its updated correctly wifi firmware version: 51e0778dc coex firmware version: 3b39fc607

Furthermore, we will merge this fix to master-v5.0.

Espressif-liuuuu avatar Jun 25 '24 06:06 Espressif-liuuuu

Furthermore, we will merge this fix to master-v5.0.

@Espressif-liuuuu The fix is not yet available in v5.0 ~ v5.2.

AxelLin avatar Aug 20 '24 05:08 AxelLin

Furthermore, we will merge this fix to master-v5.0.

@Espressif-liuuuu The fix is not yet available in v5.0 ~ v5.2.

v5.2: 36e0c4898
v5.1: d049d69a
v5.0: a2434a844

Sorry for the miss in commit information, its described in release note.

Espressif-liuuuu avatar Aug 20 '24 06:08 Espressif-liuuuu