Thespian icon indicating copy to clipboard operation
Thespian copied to clipboard

multiprocTCPBase: Example keeps breaking on `InvalidActorSpecification` on my machine

Open s-t-a-n opened this issue 2 years ago • 4 comments

Hi!

I've been playing with multiProcTCPBase the entire evening for a project but I have the weirdest issue with it. It seems to be tracking state accross runs somehow.

I started out with the simpleSystemBase and for timing reasons started using multiProcTCPBase. I got weird stacktraces for running simple actors running some basic wakeupAfter logic. I gave up early as it seemed I was somehow completely borking this.

After a cup of coffee I picked up the example hellogoodbye.py from this repo and every works.

I refactor code back to goal state. Sort of works. I keep cycling through normal write and build cycles. Then poof, same issue as before.

Tracking back to the example I found that it now is no longer working as well, even cross reboot.

Killing Python processes should do away with all states, no? I nuke 'em with pkill -9 and check that there are not Python processes stuck.

*Trace from running the hellogoodbye.py.

Stacktrace

stan@host ~ % python3.9 thespian_multitcp.py multiprocTCPBase 
Traceback (most recent call last):
  File "/home/stan/thespian_multitcp.py", line 43, in <module>
    run_example(sys.argv[1] if len(sys.argv) > 1 else None)
  File "/home/stan/thespian_multitcp.py", line 34, in run_example
    hello = ActorSystem().createActor(Hello)
  File "/home/stan/.local/lib/python3.9/site-packages/thespian/actors.py", line 704, in createActor
    return self._systemBase.newPrimaryActor(actorClass,
  File "/home/stan/.local/lib/python3.9/site-packages/thespian/system/systemBase.py", line 195, in newPrimaryActor
    raise InvalidActorSpecification(actorClass,
thespian.actors.InvalidActorSpecification: Invalid Actor Specification: <class '__main__.Hello'> (module '__main__' has no attribute 'Hello')

It seems hello_callable = ActorSystem().createActor(HelloCallable) comes back with and InvalidActorSpecification for not finding the Hello attribute on __main__? Am I completely missing something here? Please tell me it's an 'aha' since otherwise this library is solid!

Details

OS: Ubuntu 20.04 Python: Python 3.9.12 Pip: 22.0.4 Package: thespian==3.10.6

s-t-a-n avatar Apr 11 '22 23:04 s-t-a-n

Thanks for the detailed information. I don't know for sure, but what I expect is happening is that an Admin is still running.

In general, the multiproc bases will start an Admin as a separate process (and also a Logger) to manage the actor system; you can think of this more like a daemon or system process. This helps provide persistent actor system functionality for the more ephemeral actors, but it does have the side effect that the information in memory when the Admin was started will persist. In this case, the InvalidActorSpecification about not knowing the Hello actor is likely because there is still a MultiProcAdmin process running from your other experiments... one which was started from different sources and didn't have a class Hello(...):... defined.

I will also note that if you have the setproctitle package available in your environment, Thespian will automatically use it to give the actors more meaningful names (and the admin will show up in a ps listing as MultiProcAdmin). Without this package being present, they will be python processes and it's a bit more difficult to determine what each one is doing (and they will be mixed with any other python processes running on your system).

There are a couple of directions you can go from here:

  1. Use a try/finally block to ensure the MultiProcAdmin is not left running when you don't want it to be:
   asys = ActorSystem(...)
   try:
     actor_stuff_here
   finally:
     asys.shutdown()
  1. Startup the actor system as ActorSystem(..., transientUnique=True) which will ensure you get a fresh copy each time (see https://thespianpy.com/doc/using.html#hH-f71b7bfa-c57b-4716-a2c7-ad83a2ed3582). You will probably still want the try/finally technique above to ensure they are cleaned up, but older actor systems should not interfere with new ones.
  2. Use loadable sources (see https://thespianpy.com/doc/using.html#hH-73b564d2-faa3-4fc8-bfdf-5362626c03be). This is a more complicated approach, but it allows fresh Actor code to be loaded to a running system. The real advantage of this is the ability to run multiple versions of your Actor definitions in parallel, allowing for a hot-upgrade of production environments. The Director (https://thespianpy.com/doc/director.html) can help with creating and managing loadable sources, but you may want to use technique 1 and 2 until you need the capabilities provided by loadable sources.

kquick avatar Apr 12 '22 01:04 kquick

Amazing! I will have at it with your suggestions and reply when I know more!

s-t-a-n avatar Apr 12 '22 14:04 s-t-a-n

Allright so I figured out what was going wrong:

Indeed I was not properly shutting down the actor framework. It seems like killing all python processes wasnt enough as even without processes running it would still choke with the same error message. Likely this was the case because of a hardlimit of 1024 open files on my system (I found the docsection about that) and/or too many tcp connections left half open.

After some rebooting and applying suggestion 1 the example was running snappy. The secondary problem I had was that I had miscalculated the overhead of dumping 100K messages to actors on startup as a way to bootstrap them with a dataset. After reducing the amount of actors to those strictly required with proper initialization (and frankly, rethinking my set up was fair in itself) everything is snappy. The speed and effortless parallelism/distribution is stellar!

With the constant testing I am right now doing suggestion 2 would obfuscate problems at this moment, but that might come in handy for certain deployment scenario's.

I can't wait to start using suggestion 3, this is outstanding functionality! Out of scope for now.

It might be nice adding a common pitfals section to the documentation. If you point me to the place this could live I'll gladly help with that. The examples provided are very clear. They however say what you should do, omitting what you shouldn't do. Like starting thousands of actors, or sending thousands of messages at once. I've also come to realize that I really should debug and profile my code outside the actorframework. This is fully reasonable in any scenario, but it could be nice to add to the quickstart guidelines (for hotheads like me who dive in without measuring the water height). The documentation generally is amazing!

I thank you a lot for your answer and contribution to this project! If you like the ideas I mentioned about common pitfalls I'll gladly put in a pr, otherwise, this issue can be closed.

s-t-a-n avatar Apr 19 '22 17:04 s-t-a-n

I'm glad to hear you got things working, and I really appreciate the feedback! It's nice to know things work well and the documentation is useful, and I'm definitely interested in your suggestions for improvement.

Would adding the "common pitfalls" to https://thespianpy.com/doc/using.html#outline-container-hH-bb3655d6-66df-42d5-9486-e81c8687e9d6 (or a subsection of that) be a good place to document those? If so, I'd welcome PR submissions to that section (https://github.com/kquick/Thespian/blob/master/doc/using.org#guidelines-for-actors-and-their-messages), or if you feel they belong elsewhere just let me know.

kquick avatar Apr 19 '22 18:04 kquick