azure-iot-sdk-python
azure-iot-sdk-python copied to clipboard
failed reconnect behavior after connectivity outage
Context
- OS : Raspbian 5.15.32-v7+ on PiZero2W
- Python 3.9.2
- pip 20.3.4 from /usr/lib/python3/dist-packages/pip (python 3.9)
- packages installed: Package Version
azure-core 1.21.1 azure-iot-device 2.9.0 azure-iot-hub 2.6.0 azure-storage-blob 12.9.0 certifi 2020.6.20 cffi 1.15.0 chardet 4.0.0 colorzero 1.1 cryptography 36.0.0 deprecation 2.1.0 distro 1.5.0 gpiozero 1.6.2 idna 2.10 isodate 0.6.0 janus 0.7.0 msrest 0.6.21 numpy 1.21.4 oauthlib 3.1.1 opencv-python 4.5.4.60 packaging 21.3 paho-mqtt 1.6.1 picamera 1.13 pip 20.3.4 psutil 5.9.0 PyAudio 0.2.11 pycparser 2.21 pyftdi 0.53.3 pynmea2 1.18.0 pyparsing 3.0.6 pyserial 3.5 PySocks 1.7.1 python-apt 2.2.1 pyusb 1.2.1 requests 2.25.1 requests-oauthlib 1.3.0 requests-unixsocket 0.2.0 rpi-ws281x 4.3.1 RPi.GPIO 0.7.0 schedule 1.1.0 scipy 1.7.3 setuptools 52.0.0 SI1145 1.0.2 six 1.16.0 smbus 1.1.post2 smbus2 0.4.1 spidev 3.5 ssh-import-id 5.10 sysv-ipc 1.1.0 typing-extensions 4.0.1 uamqp 1.4.3 urllib3 1.26.5 VL53L0X 1.0.4 wheel 0.34.2
Description of the issue
Azure IOTHub SDK not reconnecting properly after internet/connectivity outage. See final error(s) in log below.
Code sample exhibiting the issue
def iotHubClientInit(): # gets the IotHub Client details using the connection string
try:
connectionString = "xxx" #this comes from Azure
client = IoTHubDeviceClient.create_from_connection_string(connectionString)
return client
except:
logging.critical("Failed to connect to Azure IOTHub")
def requestSchedule():
iothubMessageText = '{{"trollyID": "{trollyID}", "schedule_req": 1}}'
iothubMessageText_formatted = iothubMessageText.format(trollyID=trollyID)
message = Message(iothubMessageText_formatted)
logging.info("Requested schedule from cloud")
try:
client.send_message(message)
return True
except:
logging.critical("Failed to request schedule to Azure IOTHub")
def messageFromAzure(message):
incomingCommands = json.loads(message.data)
#logging.info(f"Message received: {message.data}")
if isinstance(incomingCommands, list):
schedule.clear('cloud-scheduled') #clear existing online scheduled tasks
trollyState.heartBeat(True) #reset our heartbeats
for incomingList in incomingCommands:
if 'freq_type' in incomingList.keys() and 'freq_interval' in incomingList.keys(): #this is a schedule list
logging.info(f"Incoming schedule: {incomingList}")
sched(incomingList)
for job in schedule.get_jobs(): #lets log the jobs nicely
#logging.info(f"Scheduled job: {job}")
if job.tags: # show additional detail for cloud scheduled jobs
logging.info(f" {job.start_day} at {job.at_time}")
elif incomingCommands["Type"] == "System" and incomingCommands["Message"] == "stop":
pause()
logging.info("Pausing due to stop message from cloud")
else:
logging.critical(f"Command from Azure I'm not sure what to do with :: {incomingCommands}")
Console log of the issue
#the request schedule is iot device sending a 'ping' message to iothub
#the return is iothub responding (after hitting microservice/sql backend)
2022-05-10 20:14:51, INFO - Requested schedule from cloud (line 1212)
2022-05-10 20:14:56, INFO - hearbeat returned in 4.761431694030762 seconds (line 425)
2022-05-10 20:44:56, INFO - Requested schedule from cloud (line 1212)
2022-05-10 20:45:00, INFO - hearbeat returned in 3.891179084777832 seconds (line 425)
2022-05-10 21:15:00, INFO - Requested schedule from cloud (line 1212)
2022-05-10 21:15:05, INFO - hearbeat returned in 4.768422365188599 seconds (line 425)
#wifi outtage - and when no internet connectivity, the code does not attempt azureHub ping/heartbeats
2022-05-10 21:45:51, WARNING - ReconnectStage: DisconnectEvent received while in unexpected state - DISCONNECTED, Connected: False (line 1281)
2022-05-10 21:45:51, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 21:45:51, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Unexpected disconnection') caused by ProtocolClientError('Unknown CONNACK rc==16')\n"] (line 29)
2022-05-10 21:45:51, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 21:45:51, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Unexpected disconnection')\n"] (line 29)
2022-05-10 21:45:57, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 21:45:57, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 21:46:14, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 21:46:14, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 21:46:30, ERROR - Exception caught in background thread. Unable to handle. (line 28)
#repeat a whole bunch, snipped for brevity
2022-05-10 21:58:25, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 21:58:36, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 21:58:36, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 21:58:46, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 21:58:46, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 21:58:56, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 21:58:56, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 22:04:08, WARNING - ReconnectStage: DisconnectEvent received while in unexpected state - DISCONNECTED, Connected: False (line 1281)
2022-05-10 22:04:08, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:04:08, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Unexpected disconnection') caused by ProtocolClientError('Unknown CONNACK rc==16')\n"] (line 29)
2022-05-10 22:04:08, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:04:08, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Unexpected disconnection')\n"] (line 29)
2022-05-10 22:04:21, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:04:21, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 22:04:41, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:04:41, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 22:05:04, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:05:04, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 22:05:26, ERROR - Exception caught in background thread. Unable to handle. (line 28)
#repeat a whole bunch, snipped for brevity
2022-05-10 22:12:09, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 22:12:32, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:12:32, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
2022-05-10 22:12:52, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:12:52, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"] (line 29)
#wifi back, the internet connected message is python code noting successful ping to nameserver
2022-05-10 22:15:08, INFO - Internet connectivity has been restored (line 1450)
#back to pinging azure iothub (these don't happen when offline)
2022-05-10 22:15:08, INFO - Requested schedule from cloud (line 1212)
2022-05-10 22:15:12, INFO - hearbeat returned in 4.2701239585876465 seconds (line 425)
#received message from ipad (seperate device) through iothub askign for telemetry, and this device successfully answers
2022-05-10 22:19:10, INFO - Sending message to Azure: {"xLocation": 0, "yLocation": 40, "zLocation": 1, "batLevel": 86, "CO2": -999, "temperature": -999,"humidity": -999, "UV": -999, "FS":-999, "IR": -999} (line 1185)
#believe there are two replies as i requested telemetry twice on the ipad
2022-05-10 22:19:20, INFO - Sending message to Azure: {"xLocation": 0, "yLocation": 40, "zLocation": 1, "batLevel": 86, "CO2": -999, "temperature": -999,"humidity": -999, "UV": -999, "FS":-999, "IR": -999} (line 1185)
2022-05-10 22:42:27, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:43:29, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Unexpected disconnection') caused by ConnectionDroppedError('The connection was lost.')\n"] (line 29)
2022-05-10 22:50:47, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 22:52:51, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1123)')\n"] (line 29)
2022-05-10 23:03:19, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 23:04:41, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1123)')\n"] (line 29)
2022-05-10 23:06:50, INFO - Requested schedule from cloud (line 1212)
2022-05-10 23:08:51, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 23:09:23, ERROR - ["azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: OperationTimeout('Transport timeout on connection operation')\n"] (line 29)
2022-05-10 23:15:10, CRITICAL - Failed to request schedule to Azure IOTHub (line 1217)
2022-05-10 23:15:55, ERROR - Exception caught in background thread. Unable to handle. (line 28)
2022-05-10 23:16:31, INFO - Requested schedule from cloud (line 1212)
2022-05-10 23:16:42, ERROR - ["azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Unexpected disconnection')\n"] (line 29)
2022-05-10 23:34:09, WARNING - ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False (line 1281)
2022-05-10 23:35:47, ERROR - Exception caught in background thread. Unable to handle. (line 28)
@robarnes
This seems to be the same issue as #989, but we have not yet been able to reproduce this on our end.
Can you be specific about exactly the steps you can take to produce this issue?
This issue is occasionally happened on just one of my IoT Edge devices, sometimes I have to restart it manually, but this time it needs few retries before can connect to IoT Hub.
2022-05-24 01:19:52 [DEBUG] http://localhost:None "GET /trust-bundle?api-version=2019-01-30 HTTP/1.1" 200 1944
2022-05-24 01:19:52 [DEBUG] http://localhost:None "POST /modules/<redacted>/genid/637812807273287182/sign?api-version=2019-01-30 HTTP/1.1" 200 57
2022-05-24 01:19:52 [WARNING] ReconnectStage: DisconnectEvent received while in unexpected state - DISCONNECTED, Connected: False
2022-05-24 01:19:52 [ERROR] Exception caught in background thread. Unable to handle.
2022-05-24 01:19:52 [ERROR] ["azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Unexpected disconnection') caused by ConnectionFailedError('The connection was refused.')\n"]
[2022-05-24 01:19:52 +0000] [8] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/iothub/sync_clients.py", line 34, in handle_result
return callback.wait_for_completion()
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/common/evented_callback.py", line 70, in wait_for_completion
raise self.exception
azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError('Connection Refused: broker unavailable.')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 134, in init_process
self.load_wsgi()
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
self.wsgi = self.app.wsgi()
File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
return self.load_wsgiapp()
File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
return util.import_app(self.app_uri)
File "/usr/local/lib/python3.7/site-packages/gunicorn/util.py", line 359, in import_app
mod = importlib.import_module(module)
File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/app/wsgi.py", line 4, in <module>
app = create_app()
File "/app/main.py", line 33, in create_app
aie_client.connect_to_iothub()
File "/app/models/AieClient.py", line 22, in connect_to_iothub
self.__module_client.connect()
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/patch_documentation.py", line 146, in connect
return super(IoTHubModuleClient, self).connect()
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/iothub/sync_clients.py", line 238, in connect
handle_result(callback)
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/iothub/sync_clients.py", line 38, in handle_result
raise exceptions.ConnectionFailedError(message="Could not connect to IoTHub", cause=e)
azure.iot.device.exceptions.ConnectionFailedError: ConnectionFailedError('Could not connect to IoTHub') caused by ConnectionFailedError('Connection Refused: broker unavailable.')
@chewcw
Likely unrelated, but I wanted to know which version of Edge you are are using? Is it 1.1 or 1.2? If it is 1.2, does this problem still occur if you switch back to 1.1?
@chewcw
Likely unrelated, but I wanted to know which version of Edge you are are using? Is it 1.1 or 1.2? If it is 1.2, does this problem still occur if you switch back to 1.1?
I use v1.1. For me this case is closed, as my device is not really collecting critical data, my dirty workaround is to automatically restart the container once in a while, I did not see the error since then.
Thanks, just checking. We'll leave this issue open until the underlying problem is addressed, we're working on it.
@cartertinney we recently also had this issue.
I've only seen this issue in one of the +-150 locations we have an iot-device running. On that specific location where I've seen the issue, we have multiple iot-devices in that network. All devices had the same issue at the same time.
First we see them drop the connection and a few minutes later it goes back to "connected", but i don't think it's really connected because we could not send any messages anymore to the cloud and cloud to device call where also failing.
Some timestamps from those iot devices: Device1:
- OFFLINE: 2022-06-14T06:52:08.371138Z
- Back connected: 2022-06-14T07:04:01.330314Z Device2:
- OFFLINE: 2022-06-14T06:51:08.618656Z
- Back connected: 2022-06-14T07:27:51.403518Z Device3:
- OFFLINE: 2022-06-14T06:51:10.831858Z
- Back connected: 2022-06-14T07:12:28.085804Z Device4:
- OFFLINE: 2022-06-14T06:52:06.431400Z
- Back connected: 2022-06-14T07:04:22.240997Z Device5:
- OFFLINE: 2022-06-14T06:52:06.305089Z
- Back connected: 2022-06-14T07:03:47.626239Z Device6:
- OFFLINE: 2022-06-14T06:51:11.807573Z
- Back connected: 2022-06-14T07:04:13.902870Z Device7:
- OFFLINE: 2022-06-14T06:51:12.630817Z
- Back connected: 2022-06-14T06:53:16.747793Z
We had to reboot all the devices in order to get the iot-hub-communication working again. Other calls to some public rest_servers starting working again, it was only the iot-hub-communication that did not recover.
We have no idea what happened on that network, but it looks like some network infra went down.
Maybe the iot-hub-device-sdk doesn't handle a reboot of network infra very well?
In the past I've added this just before I call iot_hub_device_client.connect() when it got disconnect.
try:
libc = ctypes.cdll.LoadLibrary('libc.so.6')
res_init = getattr(libc, "__res_init")
except (OSError, AttributeError):
res_init_found = False
LOG.exception("Failed to fetch __res_init")
I had to do this to get the sdk working in the situation where network infra AND our iot-device rebooted at the same time. Our iot-device got an IP from the network infra but dns-servers where not passed yet. That __res_init() forces the application to refresh them. Some libraries do it them-self (like the C version of the iot-hub-device SDK it think), but the python SDK does not. Already talked about this with @BertKleewein.
I don't know, maybe this issue is somehow related. In any case, its very strange that we had them go offline all at the same time and none of them recovered without a reboot (but other internet communication recovered by them-self).
Hope this info helps to find the issue log.txt .
@robarnes @chewcw @robbinvandamme
Are any of you using a proxy by chance? Perhaps a VPN? Additionally, what version of OpenSSL are you using?
@cartertinney We are using libssl.so.1.1 in our application. Can't tell for sure if there is a proxy or not, it's always the client his network we are connected to.
But I would be surprised there is a proxy at the location I talked about above.
I've also seen strange/similar iot-hub-device-client behavior at another client who had a standard home network. Also then I had the feeling it was related to starting our application together with the network infra. We had a few different outgoing connections from our application, but also there, only the iot-hub connection failed.
Kinds regards,
Robbin
@robbinvandamme
Since you are the only person who seems to still be experiencing this problem, would you mind opening a new GitHub issue that is focused on your situation and logs? We also just released version 2.12.0 of the azure-iot-device package, which has numerous stability and logging fixes. I can't say it's going to fix your issue, but it will at least remove some potential confounds as we try and investigate.
Ok I'll update the devices with the latest version one of the upcomming months. If we still run into connection issue's i'll open a new ticket.