piker icon indicating copy to clipboard operation
piker copied to clipboard

`tractor` bug: serialzation error isn't propagated?!

Open goodboy opened this issue 3 years ago • 1 comments
trafficstars

Guilty line of code when using msgpack in tractor.

Super odd, and pretty sure @iamzoltan found this before but, basically a serialization error due to using msgpack (instead of msgspec) and sending a set causes a crash in tractor.Context.started() but this error isn't propagated upward to the calling actor (emsd in this case).

Instead we get the msg error but the contents of that message aren't an error and instead contain a return?

Dec 14 11:39:42 (emsd, 8459, _emsd_main)) [PDB] tractor._debug _debug.py:529
Attaching to pdb in crashed actor: ('emsd', 'e371c0eb-bf2f-4e79-942f-a5ed5a5eff9a')

Traceback (most recent call last):
  File "/home/goodboy/repos/tractor/tractor/trionics/_mngrs.py", line 172, in maybe_open_context
    value = cache.values[key]
KeyError: ('binance', 'ethusdt')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/goodboy/repos/tractor/tractor/_actor.py", line 186, in _invoke
    await chan.send({'return': await coro, 'cid': cid})
  File "/home/goodboy/repos/piker/piker/clearing/_ems.py", line 1023, in _emsd_main
    async with (
  File "/usr/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/goodboy/repos/piker/piker/clearing/_ems.py", line 339, in maybe_open_brokerd_trades_dialogue
    relay = await self.nursery.start(
  File "/home/goodboy/repos/piker/39/lib/python3.9/site-packages/trio/_core/_run.py", line 1041, in start
    task._eventual_parent_nursery = self
  File "/home/goodboy/repos/piker/39/lib/python3.9/site-packages/trio/_core/_run.py", line 815, in __aexit__
    raise combined_error_from_nursery
  File "/home/goodboy/repos/piker/piker/clearing/_ems.py", line 430, in open_brokerd_trades_dialogue
    async with (
  File "/usr/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/goodboy/repos/piker/piker/clearing/_paper_engine.py", line 519, in open_paperboi
    async with portal.open_context(
  File "/home/goodboy/repos/piker/39/lib/python3.9/site-packages/async_generator/_util.py", line 34, in __aenter__
    return await self._agen.asend(None)
  File "/home/goodboy/repos/tractor/tractor/_portal.py", line 428, in open_context
    first = msg['started']
KeyError: 'started'

And here is the msg contents:

(Pdb++) msg
{'return': None, 'cid': '19c78c86-4ded-4de0-9cb5-9905ffd74325'}

Not sure what the heck is going on but simple tests introduced to tractor don't seem to expose this problem which makes me think it's something to do with a race in ContextCancelled being raised in the target task and/or the wrong error being capture and set as Context._errror? it's that or some kinda wacky task-context id mixup is happening.

Needs some deep debugging inside _invoke() likely...

goodboy avatar Dec 14 '21 17:12 goodboy

Boom got it.

It's because we call await ctx.cancel() inside a finally:...from the containing open_feed(): block...

Removing this makes the serialization error propagate. Need to make a test for this in tractor..

Still no idea how to reproduce with unit test 😂

It seems to be some kind of race where the .started() call errors but the error is rewrapped and discarded during cancellation processing triggered by the calling actor? I'm lost how this is possible atm..

goodboy avatar Dec 14 '21 17:12 goodboy