rally icon indicating copy to clipboard operation
rally copied to clipboard

Upgrade thespian to fix hang on macOS

Open pquentin opened this issue 2 years ago • 7 comments

Closes #1575

See https://github.com/thespianpy/Thespian/issues/70 for more context.

pquentin avatar Jan 16 '23 13:01 pquentin

FWIW, I tested with "thespian==3.10.6" and I can't reproduce the thespian.actors.InvalidActorAddress exception.

b-deam avatar Jan 17 '23 05:01 b-deam

Thanks! Interesting. If it's not too much work, would you mind trying again with thespian debug logs activated and share the output?

pquentin avatar Jan 17 '23 06:01 pquentin

Thanks! Interesting. If it's not too much work, would you mind trying again with thespian debug logs activated and share the output?

Sure!

I think the key line is here, but I haven't dug into why my system is no longer meeting the capabilities requirements. In addition to testing with 3.10.6 I took a look through the Release Notes to see if there were any changes to the constructor that we hadn't adopted, but the fact that I can only repro this on 3.10.7 leads to to believe that Fix Issue #70: setting TCP_NODELAY on MacOS sometimes fails is probably to blame.

$ tail -f actor-system-internal.log
2023-01-17 17:24:23.715134 p11971 dbg  ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py
2023-01-17 17:24:23.715477 p11971 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 10)
2023-01-17 17:24:28.757419 p12025 dbg  ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py
2023-01-17 17:24:28.761218 p12025 I    ++++ Admin started @ ActorAddr-Q.ThespianQ / gen (3, 10)
2023-01-17 17:24:28.770882 p12025 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.ThespianQ.a, <class 'thespian.system.messages.multiproc.LoggerConnected'> msg: <thespian.system.messages.multiproc.LoggerConnected object at 0x106bd5dc0>)
2023-01-17 17:24:28.771869 p12025 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999780-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x106bd5f10>-quit_0:04:59.999757)
2023-01-17 17:24:28.774415 p11963 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999684-<class 'thespian.system.messages.admin.QueryExists'>-<thespian.system.messages.admin.QueryExists object at 0x106bcaf70>-quit_0:04:59.999662)
2023-01-17 17:24:28.775267 p12025 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.QueryExists'> msg: <thespian.system.messages.admin.QueryExists object at 0x106be6040>)
2023-01-17 17:24:28.775951 p12025 dbg  Attempting intent TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999735-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x106bd5d60>-quit_0:04:59.999715)
2023-01-17 17:24:28.776373 p12025 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999303-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x106bd5d60>-quit_0:04:59.999289)
2023-01-17 17:24:28.779637 p11963 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999913-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#1_of_None-quit_0:04:59.999898)
2023-01-17 17:24:28.780059 p12025 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~1, <class 'thespian.system.messages.admin.PendingActor'> msg: PendingActor#1_of_None)
2023-01-17 17:24:28.780399 p12025 I    Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-Q.~1
2023-01-17 17:24:28.781272 p12025 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999898-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 10, /Users/bradleydeam/perf/github.com/b-deam/rally/esrally/actor.py, 122, "Checking capabilities [{'Thespian ActorSystem N...-quit_0:04:59.999882)
- 2023-01-17 17:24:28.782002 p12025 Warn no system has compatible capabilities for Actor esrally.racecontrol.BenchmarkActor
2023-01-17 17:24:28.782706 p12025 dbg  Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999777-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999756)
2023-01-17 17:24:28.783091 p12025 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999380-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999366)
2023-01-17 17:24:31.786774 p11963 I    ActorSystem shutdown requested.

Also worth adding a stripped down repro using esrallyd:

$ esrallyd start --coordinator-ip 192.168.20.6 --node-ip 192.168.20.6
zsh: /usr/local/bin/esrallyd: bad interpreter: /usr/local/opt/python/bin/python3.7: no such file or directory
multiprocTCPBase {'coordinator': True, 'ip': '192.168.20.6', 'Convention Address.IPv4': '192.168.20.6:1900'}
Traceback (most recent call last):
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/bin/esrallyd", line 8, in <module>
    sys.exit(main())
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/rallyd.py", line 107, in main
    start(args)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/rallyd.py", line 39, in start
    actor.bootstrap_actor_system(local_ip=args.node_ip, coordinator_ip=args.coordinator_ip)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/actor.py", line 264, in bootstrap_actor_system
    return thespian.actors.ActorSystem(system_base, logDefs=log.load_configuration(), capabilities=capabilities)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 637, in __init__
    systemBase = self._startupActorSys(
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 678, in _startupActorSys
    systemBase = sbc(self, logDefs=logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocTCPBase.py", line 28, in __init__
    super(ActorSystemBase, self).__init__(system, logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 83, in __init__
    super(multiprocessCommon, self).__init__(system, logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/systemBase.py", line 326, in __init__
    self._startAdmin(self.adminAddr,
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 115, in _startAdmin
    raise InvalidActorAddress(adminAddr,
thespian.actors.InvalidActorAddress: ActorAddr-(T|:1900) is not a valid ActorSystem admin

b-deam avatar Jan 17 '23 06:01 b-deam

Thanks a lot! The only information left I need to report the bug to Thespian would be the same logs but on thespian 3.10.6.

Also, why specifically 192.168.20.6? Is it you local IP address?

pquentin avatar Jan 17 '23 08:01 pquentin

I tried it against a non-loopback interface and got the same error. I wonder if @inqueue has the same issue?

I see the same issue and I also saw it with a patched 3.10.6 during testing. Unfortunately, I actually validated the changes from https://github.com/thespianpy/Thespian/tree/issue70 by patching the version currently in use by Rally (3.10.1). The error has been referenced at https://github.com/thespianpy/Thespian/issues/70#issuecomment-1385832797.

inqueue avatar Jan 17 '23 18:01 inqueue

For what it's worth I'm seeing Warn no system has compatible capabilities for Actor esrally.racecontrol.BenchmarkActor on Linux too. Comparing the logs between Thespian 3.10.6 and 3.10.7 would help a lot.

pquentin avatar Jan 18 '23 07:01 pquentin

3.10.6

Patch:

diff --git a/pyproject.toml b/pyproject.toml
index 77469e44fceb228f02c9407bbc1f0709c351c1bf..4aee4ecf020899ce2dd0c0e02e2aa45d6674a706 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -58,7 +58,7 @@ dependencies = [
     "Jinja2==2.11.3",
     "markupsafe==2.0.1",
     # License: MIT
-    "thespian==3.10.7",
+    "thespian==3.10.6",
     # recommended library for thespian to identify actors more easily with `ps`
     # "setproctitle==1.1.10",
     # always use the latest version, these are certificate files...

Invocation:

$ make install
[...]
$ export THESPLOG_FILE="${THESPLOG_FILE:-${HOME}/.rally/logs/actor-system-internal.log}"
$ export THESPLOG_FILE_MAXSIZE=${THESPLOG_FILE_MAXSIZE:-204800}
$ export THESPLOG_THRESHOLD="DEBUG"
$ esrallyd start --coordinator-ip 192.168.20.6 --node-ip 192.168.20.6
[INFO] Successfully started actor system on node [192.168.20.6] with coordinator node IP [192.168.20.6].

Actor logs:

2023-01-19 14:18:05.685305 p6863 dbg  ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py
2023-01-19 14:18:05.685794 p6863 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 10)
2023-01-19 14:18:05.691125 p6863 dbg  Admin of ReceiveEnvelope(from: ActorAddr-(T|:64880), <class 'thespian.system.messages.multiproc.LoggerConnected'> msg: <thespian.system.messages.multiproc.LoggerConnected object at 0x1026c5610>)
2023-01-19 14:18:05.691638 p6863 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:64879)-pending-ExpiresIn_0:04:59.999885-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x1026c5a00>-quit_0:04:59.999870)
2023-01-19 14:18:05.693301 p6844 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.999882-<class 'thespian.system.messages.admin.QueryExists'>-<thespian.system.messages.admin.QueryExists object at 0x1026a4c40>-quit_0:04:59.999870)
2023-01-19 14:18:05.693816 p6863 dbg  Admin of ReceiveEnvelope(from: ActorAddr-(T|:64879), <class 'thespian.system.messages.admin.QueryExists'> msg: <thespian.system.messages.admin.QueryExists object at 0x1026c5eb0>)
2023-01-19 14:18:05.694198 p6863 dbg  Attempting intent TransportIntent(ActorAddr-(T|:64879)-pending-ExpiresIn_0:04:59.999883-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x1026c5040>-quit_0:04:59.999874)
2023-01-19 14:18:05.694414 p6863 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:64879)-pending-ExpiresIn_0:04:59.999663-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x1026c5040>-quit_0:04:59.999656)

Rally logs:

2023-01-19 03:48:05,472 -not-actor-/PID:6844 esrally.utils.process DEBUG Running subprocess [git -C /Users/bradleydeam/perf/github.com/b-deam/rally/esrally --version] with logging.
2023-01-19 03:48:05,483 -not-actor-/PID:6844 esrally.utils.process DEBUG git version 2.37.1 (Apple Git-137.1)

2023-01-19 03:48:05,483 -not-actor-/PID:6844 esrally.utils.process DEBUG Subprocess [git -C /Users/bradleydeam/perf/github.com/b-deam/rally/esrally --version] finished with return code [0].
2023-01-19 03:48:05,484 -not-actor-/PID:6844 esrally.utils.process DEBUG Running subprocess [git -C /Users/bradleydeam/perf/github.com/b-deam/rally/esrally rev-parse HEAD] with output.
2023-01-19 03:48:05,499 -not-actor-/PID:6844 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': True, 'ip': '192.168.20.6', 'Convention Address.IPv4': '192.168.20.6:1900'}].
2023-01-19 03:48:05,686 -not-actor-/PID:6863 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-(T|:1900)
2023-01-19 03:48:05,686 -not-actor-/PID:6863 root DEBUG Thespian source: /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py

3.10.7

Invocation:

$ export THESPLOG_FILE="${THESPLOG_FILE:-${HOME}/.rally/logs/actor-system-internal.log}"
$ export THESPLOG_FILE_MAXSIZE=${THESPLOG_FILE_MAXSIZE:-204800}
$ export THESPLOG_THRESHOLD="DEBUG"
$ esrallyd start --coordinator-ip 192.168.20.6 --node-ip 192.168.20.6
Traceback (most recent call last):
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/bin/esrallyd", line 8, in <module>
    sys.exit(main())
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/rallyd.py", line 107, in main
    start(args)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/rallyd.py", line 39, in start
    actor.bootstrap_actor_system(local_ip=args.node_ip, coordinator_ip=args.coordinator_ip)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/actor.py", line 263, in bootstrap_actor_system
    return thespian.actors.ActorSystem(system_base, logDefs=log.load_configuration(), capabilities=capabilities)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 637, in __init__
    systemBase = self._startupActorSys(
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 678, in _startupActorSys
    systemBase = sbc(self, logDefs=logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocTCPBase.py", line 28, in __init__
    super(ActorSystemBase, self).__init__(system, logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 83, in __init__
    super(multiprocessCommon, self).__init__(system, logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/systemBase.py", line 326, in __init__
    self._startAdmin(self.adminAddr,
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 115, in _startAdmin
    raise InvalidActorAddress(adminAddr,
thespian.actors.InvalidActorAddress: ActorAddr-(T|:1900) is not a valid ActorSystem admin

Actor logs:

2023-01-19 14:10:37.967074 p91216 dbg  ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py
2023-01-19 14:10:37.967405 p91216 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 10)

Rally logs:

2023-01-19 03:40:37,917 -not-actor-/PID:91212 esrally.utils.process DEBUG Running subprocess [git -C /Users/bradleydeam/perf/github.com/b-deam/rally/esrally --version] with logging.
2023-01-19 03:40:37,932 -not-actor-/PID:91212 esrally.utils.process DEBUG git version 2.37.1 (Apple Git-137.1)

2023-01-19 03:40:37,932 -not-actor-/PID:91212 esrally.utils.process DEBUG Subprocess [git -C /Users/bradleydeam/perf/github.com/b-deam/rally/esrally --version] finished with return code [0].
2023-01-19 03:40:37,932 -not-actor-/PID:91212 esrally.utils.process DEBUG Running subprocess [git -C /Users/bradleydeam/perf/github.com/b-deam/rally/esrally rev-parse HEAD] with output.
2023-01-19 03:40:37,951 -not-actor-/PID:91212 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': True, 'ip': '192.168.20.6', 'Convention Address.IPv4': '192.168.20.6:1900'}].
2023-01-19 03:40:37,967 -not-actor-/PID:91216 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-(T|:1900)
2023-01-19 03:40:37,967 -not-actor-/PID:91216 root DEBUG Thespian source: /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py
2023-01-19 03:40:42,963 -not-actor-/PID:91212 esrally.actor ERROR Could not initialize internal actor system.
Traceback (most recent call last):
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/actor.py", line 263, in bootstrap_actor_system
    return thespian.actors.ActorSystem(system_base, logDefs=log.load_configuration(), capabilities=capabilities)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 637, in __init__
    systemBase = self._startupActorSys(
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 678, in _startupActorSys
    systemBase = sbc(self, logDefs=logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocTCPBase.py", line 28, in __init__
    super(ActorSystemBase, self).__init__(system, logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 83, in __init__
    super(multiprocessCommon, self).__init__(system, logDefs)
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/systemBase.py", line 326, in __init__
    self._startAdmin(self.adminAddr,
  File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 115, in _startAdmin
    raise InvalidActorAddress(adminAddr,
thespian.actors.InvalidActorAddress: ActorAddr-(T|:1900) is not a valid ActorSystem admin

b-deam avatar Jan 19 '23 03:01 b-deam