Thespian
Thespian copied to clipboard
Logging differences based on startup network state
Hi again. Sorry to keep posting new issues for you.
Environment
Tested on Debian 10 (buster) with an ARM Cortex A8 (1 core) Also tested on an Ubuntu 20.04 LTS VM with 4 cores allocated to test if this was a threading issue.
Python Versions: 3.7.3 3.8.10, 3.9.5
Manually set the following environment variables:
export THESPLOG_FILE=./thespian.log
export THESPLOG_THRESHOLD=DEBUG
Problem
The Thespian logging system has unexpected behavior when network connectivity is lost on the host machine, specifically when network connectivity is lost in the form of the network adapter becoming disabled or disconnected. This is an issue since if WiFi connection is lost or an ethernet cable becomes unplugged, the network adapter is disabled.
There are two main scenarios that I've discovered when testing this issue. The two files below are used in both scenarios.
start.py
This is the startup script for this test. It sets up the actorsystem with logging then creates the top-level actor. It then enters a loop where it waits for a Ctrl+C (SIGINT) signal. When this signal is caught, the actor system is shut down and the program exits.
import signal
from thespian.actors import ActorSystem
import time
import top
import logging
class actorLogFilter(logging.Filter):
def filter(self, logrecord):
return 'actorAddress' in logrecord.__dict__
class notActorLogFilter(logging.Filter):
def filter(self, logrecord):
return 'actorAddress' not in logrecord.__dict__
logcfg = {
'version': 1,
'formatters': {
'normal': {'format': '%(levelname)-8s %(message)s'}, 'actor': {'format': '%(levelname)-8s %(actorAddress)s => %(message)s'}
},
'filters': {'isActorLog': {'()': actorLogFilter}, 'notActorLog': {'()': notActorLogFilter}},
'handlers': {
'h1': {
'class': 'logging.FileHandler',
'filename': 'actor_log.log',
'formatter': 'normal',
'filters': ['notActorLog'],
'level': logging.DEBUG
},
'h2': {
'class': 'logging.FileHandler',
'filename': 'actor_log.log',
'formatter': 'actor',
'filters': ['isActorLog'],
'level': logging.DEBUG}
},
'loggers' : {'': {'handlers': ['h1', 'h2'], 'level': logging.DEBUG}
}
}
def handler(signum: int, frame):
ActorSystem('multiprocTCPBase').shutdown()
exit(1)
if __name__ == "__main__":
asys = ActorSystem('multiprocTCPBase', logDefs=logcfg)
# Signal handling for stopping the actor system using Ctrl+C
signal.signal(signal.SIGINT, handler)
top_actor = asys.createActor(top.Top, globalName="top")
asys.tell(top_actor, top.Initialize())
while True:
time.sleep(1)
top.py
This is the top level actor. It has two wakeup loops that we are using to figure out what is happening in the logs. The first wakeup loop is print
which just prints out the number of times that it has woken up. We use this to ensure that the system is still running while network is disconnected. The second loop is network
which checks if we have network connectivity every second and prints out "CONNECTED" or "DISCONNECTED" whenever the network state changes.
import logging as log
import os
import http.client as httplib
from thespian.actors import ActorTypeDispatcher
class Initialize:
def __init__(self) -> None:
pass
def ping(ip, timeout=0.5):
conn = httplib.HTTPSConnection(ip, timeout=timeout)
try:
conn.request("HEAD", "/")
return True
except Exception:
return False
finally:
conn.close()
def is_online(ips=None):
ips = ["1.1.1.1", "8.8.8.8", "8.8.4.4", "4.2.2.4"] if ips is None else ips
return any([ping(ip) for ip in ips])
class Top(ActorTypeDispatcher):
def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.wakeup_count = 0
self.online = False
def receiveMsg_Initialize(self, msg, sender):
log.debug(f"Top {self.myAddress} (P|:{os.getpid()}) received Initialize command from {sender}")
self.wakeupAfter(timePeriod=1, payload='network')
self.wakeupAfter(timePeriod=5, payload='print')
def receiveMsg_WakeupMessage(self, msg, sender):
if msg.payload == 'network':
log.debug(f'CHECKING NETWORK STATUS')
prev_online = self.online
self.online = is_online()
if self.online != prev_online:
if self.online:
log.critical(f'CONNECTED!')
else:
log.critical(f'DISCONNECTED!')
self.wakeupAfter(timePeriod=1, payload='network')
return None
if msg.payload == 'print':
log.debug(f'WAKEUP COUNT: {self.wakeup_count}')
self.wakeup_count += 1
self.wakeupAfter(timePeriod=5, payload='print')
return None
Scenario 1: Startup Online
In this scenario, the Thespian ActorSystem is created and initialized while there is an active network connection on the host machine (connected both to the local network and able to ping out to the internet).
Log Files: actor_log, thesplog
- Ensure network is connected and we can ping out to the internet (like
8.8.8.8
) - Run
python start.py
- After logs start coming in to
actor_log.log
, disable network (disconnect from wifi, unplug ethernet cable, disable network adapter, etc.)- At this point, we notice that
actor_log.log
is no longer receiving any updates whilethespian.log
continues showing thetop
actor continuing execution as expected.
- At this point, we notice that
- Enable/reconnect network whenever you want.
- After a short time after reconnecting to the network,
actor_log.log
will receive all of the queued messages that it missed while the network was disconnected.
- After a short time after reconnecting to the network,
In this scenario, it seems that the ActorSystem is queueing all messages to the outward-facing logging actor and waiting for the network connection to reestablish. However this seems odd to me since I would have thought that thespian would use the localhost (127.0.0.1) for all internal operations such as allowing the logging actor to write to its output file. I suppose I don't fully understand the inner workings of the TCP/UDP base.
My main concern with this is that the thespian internal logging may continue queueing these messages during long periods of network loss and it will cause a system crash. I noticed that the longer I waited before reconnecting the network, the longer it took for the queued messages to get written to the actor_log.log
file.
Scenario 2: Startup Offline
In this scenario, the Thespian ActorSystem is created and initialized while there is not an active network connection on the host machine (network cable unplugged, wifi not connected, adapter disabled, etc.).
Log Files: actor_log, thesplog
- Ensure network is not connected so that we cannot ping out to the internet.
- Run
python start.py
- We notice that logs will start coming in to
actor_log.log
. - Connect to network.
- We will see a "CONNECTED" message in
actor_log.log
.
- We will see a "CONNECTED" message in
- Disable/disconnect network connection.
- We will see a "DISCONNECTED" message in
actor_log.log
.
- We will see a "DISCONNECTED" message in
This scenario is how I was expecting the ActorSystem's logging to operate in the event of network loss (such as in scenario 1's case).
Additional Notes
With multiprocUDPBase
, Scenario 1 just errored out when I lost network connectivity. actor_log, thesplog
With multiprocUDPBase
, Scenario 2 operated the same as with multiprocTCPBase
.
Gists
Here is a Gist containing the log files that were mentioned in this problem: https://gist.github.com/bryang-spindance/9bfbe442f04aa588f5af20305c187d61