Multiple clients crashes bluetooth stack: Bluetooth: hci0: Opcode 0x200e failed: -16
- raspberry pi 4B
- debian bookworm
- bleak 0.20.2-1 from
python3-bleakpackage - bluetoothctl: 5.66
Description
I have a simple python script that has the task to poll a large number of devices (50 to 100). I want to optimize the process by not serializing all connects + polls + disconnects, see below.
The result is that the bluetooth stack on my rpi crashes. Clients are no longer disconnecting and scanning never terminates.
kernel logs show:
[ 193.752537] Bluetooth: hci0: Opcode 0x200e failed: -16
[ 205.529410] Bluetooth: hci0: Opcode 0x200e failed: -16
[ 210.648622] Bluetooth: hci0: Opcode 0x200e failed: -16
Not able to power off/on as well:
Also tested with an external USB bluetooth dongle, same results.
[Biro 07]# power off
[Biro 07]# power on
Failed to set power on: org.bluez.Error.Busy
What I Did
, the procedure I have now is:
while true:
start scanning
scan for 30 seconds, collect devices in callback
stop scanning
for 1 minute:
pick the oldest polled device
await connect
await read data
await disconnect
This works pretty much ok but is rather slow as there is only one active client connection at a time.
I understand that doing multiple connects in parallel is no recommended (yes I have looked at the two-devices example), but having multiple connections established should be fine. This is what I changed my code to now, effectively this serializes the connect()s, but allows the polling and disconnect to run async:
while true:
start scanning
scan for 30 seconds, collect devices in callback
stop scanning
for 1 minute:
pick the oldest polled device
await connect
create async process that does:
await read_data
await disconnect
As far as I know, this should abide to all the rules:
- do not connect while scanning
- do not perform connects in parallel
I also tried to postpone the scanning until all the active clients have disconnecetd, but to no avail.
Minimal workable example is not easy since there is a large number of specific devices involved.
Logs
BLEAK_LOGGING=1 log attached (sorry for the newlines, this was copy/pasted from a scrollback buffer)
Code
Below is the code as I run it now, albeit stripped a bit from error handling and the protocol details that are not relevant:
#!/usr/bin/python
from bleak import BleakScanner, BleakClient
import logging
import contextlib
import traceback
import os
import asyncio
import signal
import sys
import json
import logging
import time
import struct
from time import sleep
SERVICE_UUID = "0000ffe0-0000-1000-8000-00805f9b34fb"
CHAR_HANDLE = "0000ffe1-0000-1000-8000-00805f9b34fb"
class JK:
def __init__(self, scanner, dev, address, name):
self.scanner = scanner
self.dev = dev
self.client = BleakClient(self.dev)
self.address = address
self.name = name
self.t_last_poll = time.time() - 60 * 10
async def reg_wr(self, add, vals: bytearray, length: int):
# write something to device
pass
def ncallback(self, sender: int, data: bytearray):
# handle data
pass
async def read(self):
await self.client.start_notify(CHAR_HANDLE, self.ncallback)
await self.reg_wr(COMMAND_DEVICE_INFO, b"\0\0\0\0", 0x00)
self.poll_ready = False
while not self.poll_ready:
await asyncio.sleep(0.1)
await self.client.disconnect()
self.scanner.decref()
async def poll(self):
self.scanner.incref()
client = self.client
await client.connect()
# Awaiting works fine:
# await self.read()
# But running in an async tast not:
asyncio.create_task(self.read())
class JKviewer:
def __init__(self):
self.devices = {}
self.npolling = 0
def incref(self):
self.npolling += 1
def decref(self):
self.npolling -= 1
async def poll(self):
# Find the jk with the oldest t_last_poll
jk = None
t_oldest = t_now
for address in self.devices.keys():
jk2 = self.devices[address]
if jk2.t_last_poll < t_oldest:
t_oldest = jk2.t_last_poll
jk = jk2
if jk == None:
print("nothing to do")
return
self.busy = jk.name
# Poll oldest JK
async def aux():
jk.t_last_poll = time.time()
await jk.poll()
self.busy = None
asyncio.create_task(aux())
async def start(self):
def on_scan(dev, data):
if SERVICE_UUID in data.service_uuids:
if dev.address not in self.devices:
jk = JK(self, dev, dev.address, dev.name)
self.devices[dev.address] = jk
if dev.address in self.devices:
self.devices[dev.address].t_last_seen = time.time()
while True:
# scan for a few seconds
print("scanning")
async with BleakScanner(on_scan) as scanner:
await asyncio.sleep(3)
# work the devices
print("working")
for i in range(20):
for i in range(10):
await self.poll()
await asyncio.sleep(0.1)
# wait for all the connections to be gone
print("waiting")
while self.npolling > 0:
await asyncio.sleep(0.5)
jkviewer = JKviewer()
asyncio.run(jkviewer.start())
# vi: ft=python ts=4 sw=4 et
If things are going wrong in the kernel, there isn't much we can do in Bleak. I would suggest logging Bluetooth traffic with btmon and reporting the issue to the folks that wrote the kernel driver for whichever bluetooth adapter you are using.
As far as I know, this should abide to all the rules:
do not connect while scanning do not perform connects in parallel
Where do these rules come from?
do not perform connects in parallel
Is this covered by always using await with await client.connect()?
In your code it is possible that the event loop runs await client1.connect() followed by await client2.connect(). With each operation in a different task.
Where do these rules come from?
I think I gathered these from various comments and examples throughout the project. Please correct me if any of those are wrong or not true.
Is this covered by always using await client.connect()?
Yes. As far as I understand my own code, await client1.connect() needs to finish before await client2.connect() can ever happen; the main poller keeps track of any in-progress connects with the self.busy flag. Only after await client1.connect() finishes JK.,poll will return, and another client connect can be tried.
Anyway, I have abandoned any attempts to parallelize my polling, as I'm already having a hard time to get this to work reliably in a strictly serial way, see issue #1507...
What does self.busy do? It is only set to None in your snippet.
Your snippet does the following.
start = [await for poll] -> add aux to event loop -> repeat start
aux = [await for poll] -> await connect -> add read to event loop
read = [await notify, await write, await disconnect]
It's possible for aux to be in the event loop more than once, that means that multiple tasks can be on await connect and then after they all connect there will be multiple connections can be open at the same time.
I apologize, I minimized my code to share the example in this issue and accidentally took out too much code; the busy flag was set to a value when creating the client poll task. I updated the above code.
Like I said, I should have simplified this example even more, the problem also occurs without running the asyncio task from the JKviewer.poll() function.
Having said all this: I'm still sure any abuse or wrong API usage from userspace should result in the kernel/driver/firmware getting confused beyond repair, but that's another store.
Thanks,,
self.busy will show the name of the current device but it can also be set multiple times without being None. It can be set to a name when it is currently a name.
the problem also occurs without running the asyncio task from the JKviewer.poll() function.
Can you modify your code snippet to be the minimal example that shows the problem?
FYI, calling asyncio.create_task() in the poll() methods without keeping a handle to the returned task will allow the task to be garbage collected before the task completes. See https://docs.python.org/3.12/library/asyncio-task.html#asyncio.create_task
Polling with await asyncio.sleep(0.1) also kind of defeats the purpose of asyncio. There are better alternatives like asyncio events, futures and queues depending on what you are trying to do.
FYI, calling
asyncio.create_task()in thepoll()methods without keeping a handle to the returned task will allow the task to be garbage collected before the task completes.
Right, I'll have to take care of that.
Polling with
await asyncio.sleep(0.1)also kind of defeats the purpose of asyncio. There are better alternatives like asyncio events, futures and queues depending on what you are trying to do.
Sure, I'm aware of that - my example is just lousy.
Shall I close this issue and reopen if/when I reproduce with a simpler and more compact program?
Shall I close this issue and reopen if/when I reproduce with a simpler and more compact program?
Sounds good. So far, this sounds more like a hardware issue than a Bleak issue.
Is it a problem if multiple tasks await client.connect()?
For example, event loop runs task1 which runs await client1.connect() which yields to the event loop and then the event loop runs tasks2 which runs await client2.connect()?
Is it a problem if multiple tasks await client.connect()?
Yes, it seems that is a reliable way to trigger this problem. My example code was messy though, so if I find the time to get this running again I'll come up with a more simple snippet.
Closed for now.
Should that work though?
Why doesn't it work? Because the backend can't handle multiple connection requests at the same time???
If that's the case can BleakClient ensure only one connect is requested to the backend at a time?
How do backends handle when multiple processes call connect at the same time?
I'm not sure what causes the limitation, but the "two clients" examples jumps through hoops to make sure only one connect() is running at a time:
https://github.com/hbldh/bleak/blob/develop/examples/two_devices.py#L40-L41
Why doesn't it work? Because the backend can't handle multiple connection requests at the same time???
It can be an OS limitation like BlueZ returning "operation already in progress" errors. Or it can be a hardware limitation like the issue here.
If that's the case can
BleakClientensure only one connect is requested to the backend at a time?
I would rather leave it up to the user in case they want a different behavior, like getting a busy error instead of just waiting a indefinite amount of time. If we changed it now, I think it would break a lot of people's code that depend on the current behavior.
How do backends handle multiple processes call connect at the same time?
They probably don't handle it well since it was never really designed to do that (since it often doesn't work at the OS level anyway).
Thanks for explaining and sorry if I missed the documentation on this.
like getting a busy error
Is that what happens? Why not handle a busy error instead of using the lock in the two_devices.py example?
The busy error is specific to the BlueZ backend. The lock works cross-platform - and avoids the error in the first place.
The busy error is specific to the BlueZ backend. The lock works cross-platform - and avoids the error in the first place.
It only avoids the error if another process doesn't use Bluetooth.
If we changed it now, I think it would break a lot of people's code that depend on the current behavior.
There can be a new class with the new behaviour and when you use the new class you opt in to the new behaviour.