Thespian icon indicating copy to clipboard operation
Thespian copied to clipboard

Logging differences based on startup network state

Open bryang-spindance opened this issue 2 years ago • 2 comments

Hi again. Sorry to keep posting new issues for you.

Environment

Tested on Debian 10 (buster) with an ARM Cortex A8 (1 core) Also tested on an Ubuntu 20.04 LTS VM with 4 cores allocated to test if this was a threading issue.

Python Versions: 3.7.3 3.8.10, 3.9.5

Manually set the following environment variables: export THESPLOG_FILE=./thespian.log export THESPLOG_THRESHOLD=DEBUG

Problem

The Thespian logging system has unexpected behavior when network connectivity is lost on the host machine, specifically when network connectivity is lost in the form of the network adapter becoming disabled or disconnected. This is an issue since if WiFi connection is lost or an ethernet cable becomes unplugged, the network adapter is disabled.

There are two main scenarios that I've discovered when testing this issue. The two files below are used in both scenarios.

start.py

This is the startup script for this test. It sets up the actorsystem with logging then creates the top-level actor. It then enters a loop where it waits for a Ctrl+C (SIGINT) signal. When this signal is caught, the actor system is shut down and the program exits.

import signal
from thespian.actors import ActorSystem
import time
import top

import logging
class actorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' in logrecord.__dict__

class notActorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' not in logrecord.__dict__

logcfg = {
    'version': 1,
    'formatters': {
        'normal': {'format': '%(levelname)-8s %(message)s'}, 'actor': {'format': '%(levelname)-8s %(actorAddress)s => %(message)s'}
    },
    'filters': {'isActorLog': {'()': actorLogFilter}, 'notActorLog': {'()': notActorLogFilter}},
    'handlers': {
        'h1': {
            'class': 'logging.FileHandler',
            'filename': 'actor_log.log',
            'formatter': 'normal',
            'filters': ['notActorLog'],
            'level': logging.DEBUG
        },
        'h2': {
            'class': 'logging.FileHandler',
            'filename': 'actor_log.log',
            'formatter': 'actor',
            'filters': ['isActorLog'],
            'level': logging.DEBUG}
        },
    'loggers' : {'': {'handlers': ['h1', 'h2'], 'level': logging.DEBUG}
    }
}

def handler(signum: int, frame):
    ActorSystem('multiprocTCPBase').shutdown()
    exit(1)

if __name__ == "__main__":
    asys = ActorSystem('multiprocTCPBase', logDefs=logcfg)
    # Signal handling for stopping the actor system using Ctrl+C
    signal.signal(signal.SIGINT, handler)

    top_actor = asys.createActor(top.Top, globalName="top")
    asys.tell(top_actor, top.Initialize())

    while True:
        time.sleep(1)

top.py

This is the top level actor. It has two wakeup loops that we are using to figure out what is happening in the logs. The first wakeup loop is print which just prints out the number of times that it has woken up. We use this to ensure that the system is still running while network is disconnected. The second loop is network which checks if we have network connectivity every second and prints out "CONNECTED" or "DISCONNECTED" whenever the network state changes.

import logging as log
import os
import http.client as httplib
from thespian.actors import ActorTypeDispatcher

class Initialize:
    def __init__(self) -> None:
        pass

def ping(ip, timeout=0.5):
    conn = httplib.HTTPSConnection(ip, timeout=timeout)
    try:
        conn.request("HEAD", "/")
        return True
    except Exception:
        return False
    finally:
        conn.close()

def is_online(ips=None):
    ips = ["1.1.1.1", "8.8.8.8", "8.8.4.4", "4.2.2.4"] if ips is None else ips
    return any([ping(ip) for ip in ips])

class Top(ActorTypeDispatcher):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.wakeup_count = 0
        self.online = False

    def receiveMsg_Initialize(self, msg, sender):
        log.debug(f"Top {self.myAddress} (P|:{os.getpid()}) received Initialize command from {sender}")
        self.wakeupAfter(timePeriod=1, payload='network')
        self.wakeupAfter(timePeriod=5, payload='print')

    def receiveMsg_WakeupMessage(self, msg, sender):
        if msg.payload == 'network':
            log.debug(f'CHECKING NETWORK STATUS')
            prev_online = self.online
            self.online = is_online()
            if self.online != prev_online:
                if self.online:
                    log.critical(f'CONNECTED!')
                else:
                    log.critical(f'DISCONNECTED!')
            self.wakeupAfter(timePeriod=1, payload='network')
            return None

        if msg.payload == 'print':
            log.debug(f'WAKEUP COUNT: {self.wakeup_count}')
            self.wakeup_count += 1
            self.wakeupAfter(timePeriod=5, payload='print')
            return None

Scenario 1: Startup Online

In this scenario, the Thespian ActorSystem is created and initialized while there is an active network connection on the host machine (connected both to the local network and able to ping out to the internet).

Log Files: actor_log, thesplog

  1. Ensure network is connected and we can ping out to the internet (like 8.8.8.8)
  2. Run python start.py
  3. After logs start coming in to actor_log.log, disable network (disconnect from wifi, unplug ethernet cable, disable network adapter, etc.)
    • At this point, we notice that actor_log.log is no longer receiving any updates while thespian.log continues showing the top actor continuing execution as expected.
  4. Enable/reconnect network whenever you want.
    • After a short time after reconnecting to the network, actor_log.log will receive all of the queued messages that it missed while the network was disconnected.

In this scenario, it seems that the ActorSystem is queueing all messages to the outward-facing logging actor and waiting for the network connection to reestablish. However this seems odd to me since I would have thought that thespian would use the localhost (127.0.0.1) for all internal operations such as allowing the logging actor to write to its output file. I suppose I don't fully understand the inner workings of the TCP/UDP base.

My main concern with this is that the thespian internal logging may continue queueing these messages during long periods of network loss and it will cause a system crash. I noticed that the longer I waited before reconnecting the network, the longer it took for the queued messages to get written to the actor_log.log file.

Scenario 2: Startup Offline

In this scenario, the Thespian ActorSystem is created and initialized while there is not an active network connection on the host machine (network cable unplugged, wifi not connected, adapter disabled, etc.).

Log Files: actor_log, thesplog

  1. Ensure network is not connected so that we cannot ping out to the internet.
  2. Run python start.py
  3. We notice that logs will start coming in to actor_log.log.
  4. Connect to network.
    • We will see a "CONNECTED" message in actor_log.log.
  5. Disable/disconnect network connection.
    • We will see a "DISCONNECTED" message in actor_log.log.

This scenario is how I was expecting the ActorSystem's logging to operate in the event of network loss (such as in scenario 1's case).

Additional Notes

With multiprocUDPBase, Scenario 1 just errored out when I lost network connectivity. actor_log, thesplog

With multiprocUDPBase, Scenario 2 operated the same as with multiprocTCPBase.

Gists

Here is a Gist containing the log files that were mentioned in this problem: https://gist.github.com/bryang-spindance/9bfbe442f04aa588f5af20305c187d61

bryang-spindance avatar Mar 15 '22 17:03 bryang-spindance