Flakey test: Run SDK-Python integration tests / Features integration test alwaysSuspending => Combinators => Test the first successful awakeable should be returned
At https://github.com/restatedev/restate/pull/3166 , CI report:
✅ alwaysSuspending => SleepWithFailures => sleepAndKillServiceEndpoint(Client, ContainerHandle)
❌ alwaysSuspending => Combinators => Test the first successful awakeable should be returned
✅ alwaysSuspending => Combinators => Test awakeable or timeout using AwaitAwakeableOrTimeout
✅ alwaysSuspending => Combinators => Test awakeable or timeout using Await any
==== alwaysSuspending results
* Succeeded tests: 38 / 39
logs at:
sdk-python-integration-test-report.zip
dig into the log, find:
2025-04-16T08:06:02.410331Z WARN restate_invoker_impl
Invocation error, retrying in 122ms 112us 144ns.
error: Got a (transient) error from the service while processing invocation.
[500 Internal] TypeError('to_durable_future.<locals>.side_effect() takes 1 positional argument but 4 were given').
> This error originated after executing Command: Run(named "run should fail command")[command index 6]
restate.invocation.id: inv_1hgVUmTueTje6jP81ATNyhhahOn4QAEODA
restate.invocation.target: VirtualObjectCommandInterpreter/168fbfe6-666d-4a2f-bb5c-0b2747a6b211/interpretCommands
restate.error.code: RT0007
Visit https://docs.restate.dev/references/errors#RT0007 for more info and suggestions.
restate.invocation.error.stacktrace:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/restate/server_context.py", line 424, in create_run_coroutine
action_result = typing.cast(T, await asyncio.to_thread(action))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: to_durable_future.<locals>.side_effect() takes 1 positional argument but 4 were given
Thanks for reporting this issue @lsytj0413. There is another test run where the same test fails: https://github.com/restatedev/restate/actions/runs/14488849773/job/40642088210 with the same error.
What's interesting is that the test fails not in all profiles and also not deterministically which I would have expected if a TypeError occurs.
I think the TypeError is red herring. The underlying problem seems to be that in some cases, we take roughly 10s to start processing the incoming invocations. If this happens, then we hit the 10s default timeout defined by the tests. The problem seems to be that in some cases we fail to fetch the metadata on the first attempt. If this happens, then we wait for the next tick which is 10s later to refresh the metadata. Only then the cluster becomes operational. I think that the first attempt can fail if the read request arrives at the replicated metadata store before the commit index has been retrieved. That's why the problem doesn't occur always. A simple solution can be to also update the metadata in the MetadataManager when provisioning the cluster.
For fixing the test code, I've created https://github.com/restatedev/sdk-python/issues/78.
The failed KafkaIngress test here seems to show the same symptoms: https://github.com/restatedev/restate/actions/runs/14511275278/job/40711599370.
Looks like another instance of this https://github.com/restatedev/sdk-python/actions/runs/14578535130/job/40889927890