bleak Multiple clients crashes bluetooth stack: Bluetooth: hci0: Opcode 0x200e failed: -16

raspberry pi 4B
debian bookworm
bleak 0.20.2-1 from python3-bleak package
bluetoothctl: 5.66

Description

I have a simple python script that has the task to poll a large number of devices (50 to 100). I want to optimize the process by not serializing all connects + polls + disconnects, see below.

The result is that the bluetooth stack on my rpi crashes. Clients are no longer disconnecting and scanning never terminates.

kernel logs show:

[  193.752537] Bluetooth: hci0: Opcode 0x200e failed: -16
[  205.529410] Bluetooth: hci0: Opcode 0x200e failed: -16
[  210.648622] Bluetooth: hci0: Opcode 0x200e failed: -16

Not able to power off/on as well:

Also tested with an external USB bluetooth dongle, same results.

[Biro 07]# power off
[Biro 07]# power on
Failed to set power on: org.bluez.Error.Busy

What I Did

, the procedure I have now is:

while true:
  start scanning 
  scan for 30 seconds, collect devices in callback
  stop scanning
  for 1 minute:
    pick the oldest polled device
    await connect
    await read data
    await disconnect

This works pretty much ok but is rather slow as there is only one active client connection at a time.

I understand that doing multiple connects in parallel is no recommended (yes I have looked at the two-devices example), but having multiple connections established should be fine. This is what I changed my code to now, effectively this serializes the connect()s, but allows the polling and disconnect to run async:

while true:
  start scanning 
  scan for 30 seconds, collect devices in callback
  stop scanning
  for 1 minute:
    pick the oldest polled device
    await connect
    create async process that does:
       await read_data
       await disconnect

As far as I know, this should abide to all the rules:

do not connect while scanning
do not perform connects in parallel

I also tried to postpone the scanning until all the active clients have disconnecetd, but to no avail.

Minimal workable example is not easy since there is a large number of specific devices involved.

Logs

BLEAK_LOGGING=1 log attached (sorry for the newlines, this was copy/pasted from a scrollback buffer)

debug-log.txt

Code

Below is the code as I run it now, albeit stripped a bit from error handling and the protocol details that are not relevant:

#!/usr/bin/python

from bleak import BleakScanner, BleakClient
import logging
import contextlib
import traceback
import os
import asyncio
import signal
import sys
import json
import logging
import time
import struct
from time import sleep

SERVICE_UUID = "0000ffe0-0000-1000-8000-00805f9b34fb"
CHAR_HANDLE = "0000ffe1-0000-1000-8000-00805f9b34fb"

class JK:

    def __init__(self, scanner, dev, address, name):
        self.scanner = scanner
        self.dev = dev
        self.client = BleakClient(self.dev)
        self.address = address
        self.name = name
        self.t_last_poll = time.time() - 60 * 10

    async def reg_wr(self, add, vals: bytearray, length: int):
        # write something to device
        pass

    def ncallback(self, sender: int, data: bytearray):
        # handle data
        pass

    async def read(self):
        await self.client.start_notify(CHAR_HANDLE, self.ncallback)
        await self.reg_wr(COMMAND_DEVICE_INFO, b"\0\0\0\0", 0x00)
        self.poll_ready = False
        while not self.poll_ready:
            await asyncio.sleep(0.1)
        await self.client.disconnect()
        self.scanner.decref()

    async def poll(self):
        self.scanner.incref()
        client = self.client
        await client.connect()
    
        # Awaiting works fine:
        # await self.read()
    
        # But running in an async tast not:
        asyncio.create_task(self.read())


class JKviewer:

    def __init__(self):
        self.devices = {}
        self.npolling = 0

    def incref(self):
        self.npolling += 1

    def decref(self):
        self.npolling -= 1


    async def poll(self):
            
        # Find the jk with the oldest t_last_poll
        jk = None
        t_oldest = t_now
        for address in self.devices.keys():
            jk2 = self.devices[address]
            if jk2.t_last_poll < t_oldest:
                t_oldest = jk2.t_last_poll
                jk = jk2

        if jk == None:
            print("nothing to do")
            return
        self.busy = jk.name

        # Poll oldest JK
        async def aux():
            jk.t_last_poll = time.time()
            await jk.poll()
            self.busy = None
        asyncio.create_task(aux())


    async def start(self):

        def on_scan(dev, data):
            if SERVICE_UUID in data.service_uuids:
                if dev.address not in self.devices:
                    jk = JK(self, dev, dev.address, dev.name)
                    self.devices[dev.address] = jk
                if dev.address in self.devices:
                    self.devices[dev.address].t_last_seen = time.time()

        while True:
            # scan for a few seconds
            print("scanning")
            async with BleakScanner(on_scan) as scanner:
                await asyncio.sleep(3)
            # work the devices
            print("working")
            for i in range(20):
                for i in range(10):
                    await self.poll()
                    await asyncio.sleep(0.1)
            # wait for all the connections to be gone
            print("waiting")
            while self.npolling > 0:
                await asyncio.sleep(0.5)


jkviewer = JKviewer()
asyncio.run(jkviewer.start())

# vi: ft=python ts=4 sw=4 et

Feb 02 '24 13:02 zevv

If things are going wrong in the kernel, there isn't much we can do in Bleak. I would suggest logging Bluetooth traffic with btmon and reporting the issue to the folks that wrote the kernel driver for whichever bluetooth adapter you are using.

Feb 02 '24 15:02 dlech

As far as I know, this should abide to all the rules:

do not connect while scanning do not perform connects in parallel

Where do these rules come from?

do not perform connects in parallel

Is this covered by always using await with await client.connect()?

In your code it is possible that the event loop runs await client1.connect() followed by await client2.connect(). With each operation in a different task.

Feb 13 '24 19:02 Siecje

Where do these rules come from?

I think I gathered these from various comments and examples throughout the project. Please correct me if any of those are wrong or not true.

Is this covered by always using await client.connect()?

Yes. As far as I understand my own code, await client1.connect() needs to finish before await client2.connect() can ever happen; the main poller keeps track of any in-progress connects with the self.busy flag. Only after await client1.connect() finishes JK.,poll will return, and another client connect can be tried.

Anyway, I have abandoned any attempts to parallelize my polling, as I'm already having a hard time to get this to work reliably in a strictly serial way, see issue #1507...

Feb 13 '24 20:02 zevv

What does self.busy do? It is only set to None in your snippet.

Your snippet does the following.

start = [await for poll] -> add aux to event loop -> repeat start
aux = [await for poll] -> await connect -> add read to event loop
read = [await notify, await write, await disconnect]

It's possible for aux to be in the event loop more than once, that means that multiple tasks can be on await connect and then after they all connect there will be multiple connections can be open at the same time.

Feb 13 '24 20:02 Siecje

I apologize, I minimized my code to share the example in this issue and accidentally took out too much code; the busy flag was set to a value when creating the client poll task. I updated the above code.

Like I said, I should have simplified this example even more, the problem also occurs without running the asyncio task from the JKviewer.poll() function.

Having said all this: I'm still sure any abuse or wrong API usage from userspace should result in the kernel/driver/firmware getting confused beyond repair, but that's another store.

Thanks,,

Feb 13 '24 20:02 zevv

self.busy will show the name of the current device but it can also be set multiple times without being None. It can be set to a name when it is currently a name.

the problem also occurs without running the asyncio task from the JKviewer.poll() function.

Can you modify your code snippet to be the minimal example that shows the problem?

Feb 13 '24 21:02 Siecje

FYI, calling asyncio.create_task() in the poll() methods without keeping a handle to the returned task will allow the task to be garbage collected before the task completes. See https://docs.python.org/3.12/library/asyncio-task.html#asyncio.create_task

Polling with await asyncio.sleep(0.1) also kind of defeats the purpose of asyncio. There are better alternatives like asyncio events, futures and queues depending on what you are trying to do.

Feb 14 '24 00:02 dlech

FYI, calling asyncio.create_task() in the poll() methods without keeping a handle to the returned task will allow the task to be garbage collected before the task completes.

Right, I'll have to take care of that.

Polling with await asyncio.sleep(0.1) also kind of defeats the purpose of asyncio. There are better alternatives like asyncio events, futures and queues depending on what you are trying to do.

Sure, I'm aware of that - my example is just lousy.

Shall I close this issue and reopen if/when I reproduce with a simpler and more compact program?

Feb 14 '24 05:02 zevv

Shall I close this issue and reopen if/when I reproduce with a simpler and more compact program?

Sounds good. So far, this sounds more like a hardware issue than a Bleak issue.

Feb 14 '24 17:02 dlech

Is it a problem if multiple tasks await client.connect()?

For example, event loop runs task1 which runs await client1.connect() which yields to the event loop and then the event loop runs tasks2 which runs await client2.connect()?

Feb 14 '24 17:02 Siecje

Is it a problem if multiple tasks await client.connect()?

Yes, it seems that is a reliable way to trigger this problem. My example code was messy though, so if I find the time to get this running again I'll come up with a more simple snippet.

Closed for now.

Feb 14 '24 17:02 zevv

Should that work though?

Why doesn't it work? Because the backend can't handle multiple connection requests at the same time???

If that's the case can BleakClient ensure only one connect is requested to the backend at a time?

How do backends handle when multiple processes call connect at the same time?

Feb 14 '24 18:02 Siecje

I'm not sure what causes the limitation, but the "two clients" examples jumps through hoops to make sure only one connect() is running at a time:

https://github.com/hbldh/bleak/blob/develop/examples/two_devices.py#L40-L41

Feb 14 '24 18:02 zevv

Why doesn't it work? Because the backend can't handle multiple connection requests at the same time???

It can be an OS limitation like BlueZ returning "operation already in progress" errors. Or it can be a hardware limitation like the issue here.

If that's the case can BleakClient ensure only one connect is requested to the backend at a time?

I would rather leave it up to the user in case they want a different behavior, like getting a busy error instead of just waiting a indefinite amount of time. If we changed it now, I think it would break a lot of people's code that depend on the current behavior.

How do backends handle multiple processes call connect at the same time?

They probably don't handle it well since it was never really designed to do that (since it often doesn't work at the OS level anyway).

Feb 14 '24 23:02 dlech

Thanks for explaining and sorry if I missed the documentation on this.

like getting a busy error

Is that what happens? Why not handle a busy error instead of using the lock in the two_devices.py example?

Feb 14 '24 23:02 Siecje

The busy error is specific to the BlueZ backend. The lock works cross-platform - and avoids the error in the first place.

Feb 14 '24 23:02 dlech

The busy error is specific to the BlueZ backend. The lock works cross-platform - and avoids the error in the first place.

It only avoids the error if another process doesn't use Bluetooth.

Feb 15 '24 16:02 Siecje

If we changed it now, I think it would break a lot of people's code that depend on the current behavior.

There can be a new class with the new behaviour and when you use the new class you opt in to the new behaviour.

Feb 15 '24 16:02 Siecje