restate icon indicating copy to clipboard operation
restate copied to clipboard

Flakey test: Run SDK-Python integration tests / Features integration test alwaysSuspending => Combinators => Test the first successful awakeable should be returned

Open lsytj0413 opened this issue 8 months ago • 6 comments

At https://github.com/restatedev/restate/pull/3166 , CI report:

✅ alwaysSuspending => SleepWithFailures => sleepAndKillServiceEndpoint(Client, ContainerHandle)
❌ alwaysSuspending => Combinators => Test the first successful awakeable should be returned
✅ alwaysSuspending => Combinators => Test awakeable or timeout using AwaitAwakeableOrTimeout
✅ alwaysSuspending => Combinators => Test awakeable or timeout using Await any
==== alwaysSuspending results
* Succeeded tests: 38 / 39

logs at:

sdk-python-integration-test-report.zip

dig into the log, find:

2025-04-16T08:06:02.410331Z WARN restate_invoker_impl
  Invocation error, retrying in 122ms 112us 144ns.
    error: Got a (transient) error from the service while processing invocation.
      [500 Internal] TypeError('to_durable_future.<locals>.side_effect() takes 1 positional argument but 4 were given').
      > This error originated after executing Command: Run(named "run should fail command")[command index 6]
    restate.invocation.id: inv_1hgVUmTueTje6jP81ATNyhhahOn4QAEODA
    restate.invocation.target: VirtualObjectCommandInterpreter/168fbfe6-666d-4a2f-bb5c-0b2747a6b211/interpretCommands
    restate.error.code: RT0007
    Visit https://docs.restate.dev/references/errors#RT0007 for more info and suggestions.
    restate.invocation.error.stacktrace:
      Traceback (most recent call last):
        File "/usr/local/lib/python3.12/site-packages/restate/server_context.py", line 424, in create_run_coroutine
          action_result = typing.cast(T, await asyncio.to_thread(action))
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/asyncio/threads.py", line 25, in to_thread
          return await loop.run_in_executor(None, func_call)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 59, in run
          result = self.fn(*self.args, **self.kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      TypeError: to_durable_future.<locals>.side_effect() takes 1 positional argument but 4 were given

lsytj0413 avatar Apr 17 '25 01:04 lsytj0413

Thanks for reporting this issue @lsytj0413. There is another test run where the same test fails: https://github.com/restatedev/restate/actions/runs/14488849773/job/40642088210 with the same error.

tillrohrmann avatar Apr 17 '25 09:04 tillrohrmann

What's interesting is that the test fails not in all profiles and also not deterministically which I would have expected if a TypeError occurs.

tillrohrmann avatar Apr 17 '25 09:04 tillrohrmann

I think the TypeError is red herring. The underlying problem seems to be that in some cases, we take roughly 10s to start processing the incoming invocations. If this happens, then we hit the 10s default timeout defined by the tests. The problem seems to be that in some cases we fail to fetch the metadata on the first attempt. If this happens, then we wait for the next tick which is 10s later to refresh the metadata. Only then the cluster becomes operational. I think that the first attempt can fail if the read request arrives at the replicated metadata store before the commit index has been retrieved. That's why the problem doesn't occur always. A simple solution can be to also update the metadata in the MetadataManager when provisioning the cluster.

tillrohrmann avatar Apr 17 '25 13:04 tillrohrmann

For fixing the test code, I've created https://github.com/restatedev/sdk-python/issues/78.

tillrohrmann avatar Apr 17 '25 13:04 tillrohrmann

The failed KafkaIngress test here seems to show the same symptoms: https://github.com/restatedev/restate/actions/runs/14511275278/job/40711599370.

tillrohrmann avatar Apr 17 '25 13:04 tillrohrmann

Looks like another instance of this https://github.com/restatedev/sdk-python/actions/runs/14578535130/job/40889927890

slinkydeveloper avatar Apr 22 '25 07:04 slinkydeveloper