esp-idf icon indicating copy to clipboard operation
esp-idf copied to clipboard

HCI: Create Connection request is not properly terminated which blocks BlueZ from reconnecting (IDFGH-6082)

Open danergo opened this issue 4 years ago • 60 comments

Environment

Development Kit: NodeMCU-ESP-32S Module or chip used: ESP-WROOM-32 IDF version (run git describe --tags to find it): v4.4-dev-2533-g83956ebbae Build System: idf.py Compiler version (run xtensa-esp32-elf-gcc --version to find it): xtensa-esp32-elf-gcc (crosstool-NG esp-2021r1) 8.4.0 Operating System: Linux Using an IDE?: No Power Supply: external 5V

Problem Description

I'm using ESP32 as a Bluetooth controller with the ESP-IDF provided hciuart sample: https://github.com/espressif/esp-idf/tree/master/examples/bluetooth/hci/controller_hci_uart_esp32, and BlueZ as host.

Whenever my application tries to establish a connection by sending a Create Connection, and I abort it early by sending a Create Connection Cancel HCI request, ESP32 won't respond to the Create Connection, which blocks BlueZ from sending out any other new Create Connection (to the same peer).

Only way to escape from this is to completely reset the HCI device.

Expected Behavior && Actual Behavior && Steps to reproduce

I have collected logs from a correctly working (Cypress) device to make it easier for you to fix this. I'm using here a fake Bluetooth peer address to make 100% sure it is not pingable and for also making sure this problem doesn't depend at all on any particular peer device.

1st image demonstrates the Page Timeout in case I don't early-abort the connection trial. Left side is from Cypress, right side is ESP32: image

2nd image demonstrates the early-aborted connection trial after the 1st Page Timeout, without resetting the HCI device. Still left side is Cypress, right is ESP32. Problem starts manifesting here: while Cypress responds to the Create Connection Cancel request with also sending "Connect Complete", ESP32 does not completes the connection which confuses BlueZ (will think ESP is still trying to connect): image

3rd image demonstrates the complete mess, because you haven't closed the earlier connection by responding Connect Complete, BlueZ doesn't send the new Create Connection request at all (It still "thinks" ESP is trying to fulfill the first request): image

Due to this, whenever you try connecting to a device which accidentally or intentionally having its Bluetooth disabled, you will not be able to connect to it even after its Bluetooth gets re-enabled.

danergo avatar Oct 24 '21 08:10 danergo

Note: this is the root-cause of https://github.com/espressif/esp-idf/issues/7723.

danergo avatar Oct 24 '21 08:10 danergo

Anybody? Please help me guys. This bug is in your closed software part!

danergo avatar Oct 27 '21 06:10 danergo

Hi @danergo

Thank you very much for providing such detailed analysis. We will further analyze this scenario to confirm whether it is a problem and give you a reply as soon as possible.

xiewenxiang avatar Oct 27 '21 08:10 xiewenxiang

Hi @xiewenxiang: could you check this one, by any chance yet?

Thank you!

danergo avatar Oct 30 '21 18:10 danergo

Hi @danergo, a fix for this issue is done and waiting for approval.

BetterJincheng avatar Nov 01 '21 01:11 BetterJincheng

Hi @BetterJincheng: do you see by any chance how much time will this take? Or how much time an approval takes usually?

Thank you

danergo avatar Nov 09 '21 07:11 danergo

@danergo, it is hard to evaluate... If needed, I can provide you the fixed lib in advance. Let me know

BetterJincheng avatar Nov 09 '21 07:11 BetterJincheng

@BetterJincheng: that sounds perfect, please do that and I can continue the testing. Thank you! You can also send it to [email protected] if that suits better.

danergo avatar Nov 09 '21 09:11 danergo

libbtdm_app.zip @danergo

  1. Replace esp-idf/component/bt/controller/lib_esp32/esp32/libbtdm_app.a with the attachment.
  2. idf.py fullclean (or just rm -rf ./build)

Let me know if it works.

BetterJincheng avatar Nov 09 '21 09:11 BetterJincheng

@BetterJincheng:

Sorry for delays, I just had time to validate your fix now: it seems working, I'll run my usual testing method for a couple of more days and report back.

Thanks a lot!

danergo avatar Nov 14 '21 18:11 danergo

Bad news:

After several days of constant stress-test (HCI LE Advertisement scanning, and some GATT commands, and also some raw classic BT stuff) and ESP32 crashed:

[2021-11-19 16:19:05] Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Core  0 register dump:
[2021-11-19 16:19:05] PC      : 0x400def1d  PS      : 0x00060130  A0      : 0x80049b9c  A1      : 0x3ffcc750
[2021-11-19 16:19:05] A2      : 0x07ffffff  A3      : 0x02fa9bd0  A4      : 0x0000000b  A5      : 0x00000001
[2021-11-19 16:19:05] A6      : 0x3ffcc7b0  A7      : 0x00000000  A8      : 0x3ffc9324  A9      : 0x3ffcc730
[2021-11-19 16:19:05] A10     : 0x0030bb00  A11     : 0x00000001  A12     : 0x00000000  A13     : 0x3ffafd68
[2021-11-19 16:19:05] A14     : 0x00000000  A15     : 0x3ffb8360  SAR     : 0x00000016  EXCCAUSE: 0x0000001c
[2021-11-19 16:19:05] EXCVADDR: 0x00000008  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000
[2021-11-19 16:19:05]
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Backtrace:0x400def1a:0x3ffcc750 0x40049b99:0x3ffcc7b0 0x400457cd:0x3ffcc7e0 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

I decoded the backtrace but it is not too verbose:

0x400def1a:  r_lld_evt_update_create at ??:?
0x40049b99: ?? ??:0
0x400457cd: ?? ??:0
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

That line in port.c: pxCode(pvParameters);

Update After a short while another crash arrived:

[2021-11-19 16:39:25] ASSERT_ERR(lm_env.local_name), in lm.c at line 79
[2021-11-19 16:39:25] Guru Meditation Error: Core  0 panic'ed (IllegalInstruction). Exception was unhandled.
[2021-11-19 16:39:25] Memory dump at 0x4008ee00: f01d020c 00004136 f01d0000
[2021-11-19 16:39:25] Core  0 register dump:
[2021-11-19 16:39:25] PC      : 0x4008ee07  PS      : 0x00060130  A0      : 0x80083725  A1      : 0x3ffcc760
[2021-11-19 16:39:25] A2      : 0x3f40491c  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x3f404831
[2021-11-19 16:39:25] A6      : 0x0000004f  A7      : 0xfffffffd  A8      : 0x8000814b  A9      : 0x3ffcc6d0
[2021-11-19 16:39:25] A10     : 0x00000000  A11     : 0x3ffcc6f2  A12     : 0x3ffcc69f  A13     : 0x00000039
[2021-11-19 16:39:25] A14     : 0x00000000  A15     : 0x3ffcc6a3  SAR     : 0x00000004  EXCCAUSE: 0x00000000
[2021-11-19 16:39:25] EXCVADDR: 0x00000000  LBEG    : 0x40083699  LEND    : 0x400836a1  LCOUNT  : 0x00000000
[2021-11-19 16:39:25]
[2021-11-19 16:39:25]
[2021-11-19 16:39:25] Backtrace:0x4008ee04:0x3ffcc760 0x40083722:0x3ffcc780 0x400e556e:0x3ffcc7a0 0x4002509e:0x3ffcc7c0 0x400da89a:0x3ffcc800 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

Decoded:

0x4008ee04: r_assert at /home/user/esp/esp-idf/components/bt/controller/esp32/bt.c:1991
0x40083722: r_assert_err at ??:?
0x400e556e: r_LM_GetLocalNameSeg at ??:?
0x4002509e: ?? ??:0
0x400da89a: lc_lmp_rx_handler at ??:?
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

Rest is similar to previous.

Questions: 1.) Why are almost all lines not having line numbers? 2.) Can I change anything to have more info from this backtrace? (Or I can share my binary for you in case it helps).

Thank you!

danergo avatar Nov 19 '21 15:11 danergo

Can you have a look at it? It's very annoying and I have to do hciconfig hci0 reset once every 2-3 days constantly.

Thank you.

danergo avatar Nov 24 '21 14:11 danergo

Now ESP32 printed "scan_evt timeout", and I also did HAVE TO do hciconfig hci0 reset in order to use it. It's very unstable.

danergo avatar Nov 24 '21 16:11 danergo

Bad news:

After several days of constant stress-test (HCI LE Advertisement scanning, and some GATT commands, and also some raw classic BT stuff) and ESP32 crashed:

[2021-11-19 16:19:05] Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Core  0 register dump:
[2021-11-19 16:19:05] PC      : 0x400def1d  PS      : 0x00060130  A0      : 0x80049b9c  A1      : 0x3ffcc750
[2021-11-19 16:19:05] A2      : 0x07ffffff  A3      : 0x02fa9bd0  A4      : 0x0000000b  A5      : 0x00000001
[2021-11-19 16:19:05] A6      : 0x3ffcc7b0  A7      : 0x00000000  A8      : 0x3ffc9324  A9      : 0x3ffcc730
[2021-11-19 16:19:05] A10     : 0x0030bb00  A11     : 0x00000001  A12     : 0x00000000  A13     : 0x3ffafd68
[2021-11-19 16:19:05] A14     : 0x00000000  A15     : 0x3ffb8360  SAR     : 0x00000016  EXCCAUSE: 0x0000001c
[2021-11-19 16:19:05] EXCVADDR: 0x00000008  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000
[2021-11-19 16:19:05]
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Backtrace:0x400def1a:0x3ffcc750 0x40049b99:0x3ffcc7b0 0x400457cd:0x3ffcc7e0 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

I decoded the backtrace but it is not too verbose:

0x400def1a:  r_lld_evt_update_create at ??:?
0x40049b99: ?? ??:0
0x400457cd: ?? ??:0
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

That line in port.c: pxCode(pvParameters);

Update After a short while another crash arrived:

[2021-11-19 16:39:25] ASSERT_ERR(lm_env.local_name), in lm.c at line 79
[2021-11-19 16:39:25] Guru Meditation Error: Core  0 panic'ed (IllegalInstruction). Exception was unhandled.
[2021-11-19 16:39:25] Memory dump at 0x4008ee00: f01d020c 00004136 f01d0000
[2021-11-19 16:39:25] Core  0 register dump:
[2021-11-19 16:39:25] PC      : 0x4008ee07  PS      : 0x00060130  A0      : 0x80083725  A1      : 0x3ffcc760
[2021-11-19 16:39:25] A2      : 0x3f40491c  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x3f404831
[2021-11-19 16:39:25] A6      : 0x0000004f  A7      : 0xfffffffd  A8      : 0x8000814b  A9      : 0x3ffcc6d0
[2021-11-19 16:39:25] A10     : 0x00000000  A11     : 0x3ffcc6f2  A12     : 0x3ffcc69f  A13     : 0x00000039
[2021-11-19 16:39:25] A14     : 0x00000000  A15     : 0x3ffcc6a3  SAR     : 0x00000004  EXCCAUSE: 0x00000000
[2021-11-19 16:39:25] EXCVADDR: 0x00000000  LBEG    : 0x40083699  LEND    : 0x400836a1  LCOUNT  : 0x00000000
[2021-11-19 16:39:25]
[2021-11-19 16:39:25]
[2021-11-19 16:39:25] Backtrace:0x4008ee04:0x3ffcc760 0x40083722:0x3ffcc780 0x400e556e:0x3ffcc7a0 0x4002509e:0x3ffcc7c0 0x400da89a:0x3ffcc800 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

Decoded:

0x4008ee04: r_assert at /home/user/esp/esp-idf/components/bt/controller/esp32/bt.c:1991
0x40083722: r_assert_err at ??:?
0x400e556e: r_LM_GetLocalNameSeg at ??:?
0x4002509e: ?? ??:0
0x400da89a: lc_lmp_rx_handler at ??:?
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

Rest is similar to previous.

Questions: 1.) Why are almost all lines not having line numbers? 2.) Can I change anything to have more info from this backtrace? (Or I can share my binary for you in case it helps).

Thank you!

The elf file will do help. please attach here. Thanks

BetterJincheng avatar Nov 29 '21 03:11 BetterJincheng

@BetterJincheng:

Here is it (controller_hci_uart.elf): controller_hci_uart.zip

Thank you.

danergo avatar Nov 29 '21 05:11 danergo

libbtdm_app.zip @danergo Please try this lib to verify whether it fixed the second problem. Thanks

BetterJincheng avatar Nov 29 '21 09:11 BetterJincheng

@BetterJincheng: just to be on the same page, which is the second for you?

The one with "IllegalInstruction"?

Thank you!

danergo avatar Nov 29 '21 13:11 danergo

@BetterJincheng: I have not yet updated the libbtdm_app, but I get another crash, now with backtrace corrupted.

This is the only path which give some lights:

ASSERT_PARAM(2048 0), in rwbt.c at line 381

I hope you can take a look into this also.

danergo avatar Nov 30 '21 12:11 danergo

@BetterJincheng:

With your latest version, I'm getting a lot of error messages in dmesg:

[ 9392.795165] Bluetooth: hci0: command 0x200e tx timeout
[ 9414.395439] Bluetooth: hci0: command 0x200d tx timeout
[ 9420.251516] Bluetooth: hci0: command 0x1003 tx timeout
[ 9422.267543] Bluetooth: hci0: command 0x1001 tx timeout
[ 9422.386796] Bluetooth: hci0: unexpected event for opcode 0x200d
[ 9422.387385] Bluetooth: hci0: unexpected event for opcode 0x0c03
[ 9422.430117] Bluetooth: hci0: hardware error 0x00
[ 9422.432685] Bluetooth: hci0: unexpected event for opcode 0x0c0d
[ 9429.435631] Bluetooth: hci0: command 0x1003 tx timeout
[ 9431.451666] Bluetooth: hci0: command 0x1001 tx timeout
[ 9433.467680] Bluetooth: hci0: command 0x1009 tx timeout
[ 9442.555808] Bluetooth: hci0: command 0x1003 tx timeout
[ 9444.571824] Bluetooth: hci0: command 0x1001 tx timeout
[ 9446.587851] Bluetooth: hci0: command 0x1009 tx timeout
[ 9455.612041] Bluetooth: hci0: command 0x1003 tx timeout
[ 9457.627993] Bluetooth: hci0: command 0x1001 tx timeout
[ 9459.644091] Bluetooth: hci0: command 0x1009 tx timeout
[ 9468.668156] Bluetooth: hci0: command 0x1003 tx timeout
[ 9470.684187] Bluetooth: hci0: command 0x1001 tx timeout
[ 9472.700199] Bluetooth: hci0: command 0x1009 tx timeout
[ 9481.724306] Bluetooth: hci0: command 0x1003 tx timeout
[ 9483.740321] Bluetooth: hci0: command 0x1001 tx timeout
[ 9485.756347] Bluetooth: hci0: command 0x1009 tx timeout
[ 9502.108567] Bluetooth: hci0: command 0x1003 tx timeout
[ 9504.124582] Bluetooth: hci0: command 0x1001 tx timeout
[ 9506.140609] Bluetooth: hci0: command 0x1009 tx timeout
[ 9517.116769] Bluetooth: hci0: command 0x1003 tx timeout
[ 9519.132773] Bluetooth: hci0: command 0x1001 tx timeout
[ 9521.148808] Bluetooth: hci0: command 0x1009 tx timeout

danergo avatar Jan 18 '22 10:01 danergo

Can you please look into it soon? I'm waiting for this fix since last year. Thank you.

danergo avatar Jan 18 '22 10:01 danergo

Ping :)

danergo avatar Jan 29 '22 10:01 danergo

I have more logs now:

With your latest version of library you sent, I got this:

< HCI Command: Create Connection (0x01|0x0005) plen 13                     #4450 [hci0] 11:21:47.231569
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)
        Packet type: 0xcc18
          DM1 may be used
          DH1 may be used
          DM3 may be used
          DH3 may be used
          DM5 may be used
          DH5 may be used
        Page scan repetition mode: R2 (0x02)
        Page scan mode: Mandatory (0x00)
        Clock offset: 0x0000
        Role switch: Allow slave (0x01)
> HCI Event: Command Status (0x0f) plen 4                                  #4451 [hci0] 11:21:47.234813
      Create Connection (0x01|0x0005) ncmd 5
        Status: Success (0x00)
< HCI Command: Create Connection Cancel (0x01|0x0008) plen 6               #4452 [hci0] 11:21:50.236441
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)
> HCI Event: Command Complete (0x0e) plen 10                               #4453 [hci0] 11:21:50.238166
      Create Connection Cancel (0x01|0x0008) ncmd 5
        Status: ACL Connection Already Exists (0x0b)
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)

Now, the problem with this is when I want to cancel the connection in step #4452 (Create Connection Cancel), I got the reply in #4453 (Command Complete), that ACL Connection Already Exists.

This breaks Bluetooth's specification, as if "ACL Connection" is already existing, why haven't you sent any Connect Complete HCI event?

danergo avatar Feb 03 '22 09:02 danergo

It would be great if I can get any feedback or reply to this thread.

danergo avatar Feb 03 '22 09:02 danergo

ping

danergo avatar Feb 05 '22 05:02 danergo

< HCI Command: LE Create Connection (0x08|0x000d) plen 25                       #202072 [hci0] 13:57:10.737769
        Scan interval: 60.000 msec (0x0060)
        Scan window: 60.000 msec (0x0060)
        Filter policy: White list is not used (0x00)
        Peer address type: Public (0x00)
        Peer address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Own address type: Public (0x00)
        Min connection interval: 25.00 msec (0x0014)
        Max connection interval: 25.00 msec (0x0014)
        Connection latency: 0 (0x0000)
        Supervision timeout: 5000 msec (0x01f4)
        Min connection length: 0.000 msec (0x0000)
        Max connection length: 0.000 msec (0x0000)
< HCI Command: LE Create Connection Cancel (0x08|0x000e) plen 0                 #202073 [hci0] 13:57:14.604338
@ MGMT Event: Connect Failed (0x000d) plen 8                                   {0x0002} [hci0] 13:57:20.738171
        LE Address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Status: Not Connected (0x02)

Another violation of Bluetooth standard by your closed source stuff. No response to my "LE Create Connection Cancel".

And only after exactly 10 seconds comes the management event that connection is failed.

1. How can I change this 10 second time?

2. Please attach me a new build with debug log.

danergo avatar Feb 05 '22 13:02 danergo

@BetterJincheng: can you please look into this?

Thank you!

danergo avatar Feb 05 '22 15:02 danergo

< HCI Command: LE Create Connection (0x08|0x000d) plen 25                        #42191 [hci0] 06:35:04.599612
        Scan interval: 60.000 msec (0x0060)
        Scan window: 60.000 msec (0x0060)
        Filter policy: White list is not used (0x00)
        Peer address type: Public (0x00)
        Peer address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Own address type: Public (0x00)
        Min connection interval: 25.00 msec (0x0014)
        Max connection interval: 25.00 msec (0x0014)
        Connection latency: 0 (0x0000)
        Supervision timeout: 5000 msec (0x01f4)
        Min connection length: 0.000 msec (0x0000)
        Max connection length: 0.000 msec (0x0000)
< HCI Command: LE Create Connection Cancel (0x08|0x000e) plen 0                  #42192 [hci0] 06:35:07.272523
< HCI Command: LE Create Connection Cancel (0x08|0x000e) plen 0                  #42193 [hci0] 06:35:12.277259
> HCI Event: Command Status (0x0f) plen 4                                        #42194 [hci0] 06:35:14.598817
      LE Create Connection (0x08|0x000d) ncmd 3
        Status: Success (0x00)
> HCI Event: Command Complete (0x0e) plen 4                                      #42195 [hci0] 06:35:14.599115
      LE Create Connection Cancel (0x08|0x000e) ncmd 4
        Status: Success (0x00)

How can I cancel a(n) (LE) Create Connection before receiving its "Command Complete" event, 10 seconds later? @BetterJincheng: could you please help me?

danergo avatar Feb 06 '22 06:02 danergo

@xiewenxiang: I have provided many more logs for your modified closed-source library.

I haven't received any feedback or support from your side since last November which is now 3 months ago.

This is really annoying and unfair.

danergo avatar Feb 06 '22 06:02 danergo

ping

danergo avatar Feb 07 '22 18:02 danergo

ping

danergo avatar Feb 08 '22 16:02 danergo