dagger icon indicating copy to clipboard operation
dagger copied to clipboard

Flake from `returnError` failed to get requester session: session not found

Open jedevc opened this issue 10 months ago • 4 comments

Upon occasion, since we merged https://github.com/dagger/dagger/pull/8442, with returnError, it looks like we've started occasionally flaking:

https://github.com/dagger/dagger/actions/runs/13261220656/job/37018061021?pr=9530#step:5:4951

1   : check --targets=sdk/go
1   : [1m8s] | Error: failed to serve module: input: moduleSource.withContextDirectory.asModule failed to create module: select: failed to update module dependencies: failed to initialize dependency modules: failed to initialize dependency module: select: failed to create module: select: failed to update codegen and runtime: failed to generate code: failed to call sdk module codegen: select: process "/runtime" did not complete successfully: exit code: 2
1   : [1m8s] | 
1   : [1m8s] | Stdout:
1   : [1m8s] | failed to return error: input: currentFunctionCall.returnError failed to get requester session: session for "4582nob2pkiberrrwoc8ph6p0" not found
1   : check --targets=sdk/go ERROR [1m8s]
1   : ! exit code 2

Not really 100% sure what's happening here, I'm not really sure how the session could have been cleaned up? Maybe somehow we're in a state where the job has been cancelled, and the session has been torn down, but the WithExec is still running? (kind of a guess based on the fact that I see this only on jobs that have a tendency to get cancelled anyways). If that's the case, an error makes sense, but this doesn't seem like the right one.

cc @vito any other ideas for this?

jedevc avatar Feb 11 '25 11:02 jedevc

@vito mentions:

Maybe the ReturnError call used a different ctx so the ReturnError request itself doesn't get interrupted, but the session still got torn down anyway

jedevc avatar Feb 13 '25 14:02 jedevc

Couple more recent occurrences:

  • https://github.com/dagger/dagger/actions/runs/13427524763/job/37513152438#step:5:7108
  • https://github.com/dagger/dagger/actions/runs/13447234632/job/37575134036#step:5:7239

This could be 100% illusory, but for whatever reason it feels like this happens much more often on main runs than PRs. I rarely see it in my PRs but fairly often on main. If that's actually true, I have no explanation for how that could be right now, but seems worth noting.

sipsma avatar Feb 21 '25 00:02 sipsma

It looks like the session is disappearing and then that also causes the returnError call to fail with the same error:

Image

  • https://v3.dagger.cloud/dagger/traces/b80ff1a6f9958981f72db19c72cc0191?span=ec1484c685e45b00

The way returnError works is we first construct an Error value via error(message: ...) and then pass it to currentFunction.returnError(error: ErrorID!). So for us to be constructing error with that string, that error must have happened already during the regular function call.

Side note: returnError as an implementation detail works by streaming files around, which @sipsma suggested cleaning up in https://github.com/dagger/dagger/pull/8442#discussion_r1797353309 - but I think even with that cleanup we'd still be hitting that session not found error (just returned more cleanly).

So, still not sure why the session is disappearing in the first place.

vito avatar Feb 21 '25 19:02 vito

Something interesting just happened locally possibly related to this.

I just had a compile error in my engine code, but the end of the progress output for building the engine looked like this:

│ ∅ Container.asTarball(forcedCompression: Zstd): File! 0.1s
│ ! container image to tarball file conversion failed: failed to export: no active session for 6gx5eyq4mk6a2frp8vc9t0hai: context canceled
│ │ ✔ export layers 0.1s

Error logs:

✘ Container.withExec(args: ["go", "build", "-o", "bin/engine", "-ldflags", "-s -w -X github.com/dagger/dagger/engine.Version=v0.16.2-010101000000-dev-a61626cdb9f9 -X github.c
om/dagger/dagger/engine.Tag=6a2c1b9bf9aa4ba124d5f7fdb3517e1683c26259", "./cmd/engine"]): Container! 2.6s
# github.com/dagger/dagger/core                                                                                                                                               
core/interface.go:331:4: unknown field ImpurityReason in struct literal of type dagql.FieldSpec                                                                               
! process "go build -o bin/engine -ldflags -s -w -X github.com/dagger/dagger/engine.Version=v0.16.2-010101000000-dev-a61626cdb9f9 -X github.com/dagger/dagger/engine.Tag=6a2c1
b9bf9aa4ba124d5f7fdb3517e1683c26259 ./cmd/engine" did not complete successfully: exit code: 1

That Container.asTarball step got canceled just because of the build error, but it reported it's error as no active session... Not the exact same as failed to get requester session, but similar enough to raise an eyebrow.

cc @vito, not totally sure what to make of this but wondering if context cancellation somewhere else is resulting in the failed to get requester session to show up, and maybe for whatever reason we end up only seeing that error? Or the "real" error is somewhere else but getting hidden from progress?

sipsma avatar Feb 22 '25 02:02 sipsma

This happened to us today with engine v0.16.1 in our CI

Dagger Cloud URL: https://v3.dagger.cloud/CASTAI/traces/bdef6ab72fd64bab36fd8e4ddd90768a?span=5c7d2aaf8f5fad21

failed to return error: input: currentFunctionCall.returnError failed to get requester session: session for "jexqvofz6dv7gracrefxgfv7t" not found

logs from engine for the session:

2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.343647956Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="finished exporting cache ref 7eohatd7p70fdl1xl4xjjftyy in 15.957308ms" |  
-- | -- | --
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.343654836Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="exporting cache ref schqjyoscg756hghjj9n1t4wl" |  
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.343684696Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="getting remotes for cache ref schqjyoscg756hghjj9n1t4wl" |  
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.346465636Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="exporting metrics to clients" clients=1 |  
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.434431659Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="GetContentHashKey reusing ref wlfuup1scox5p07uy2udljbot with digest sha256:e83c3740652a83d9732dedb7f9e7cff3ebaa2258dee22e4e493c329bbb9c67f7" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t session_id=upj8tq47ycru3l30e2wl3brxp spanID=730dce48b1468ff4 traceID=bdef6ab72fd64bab36fd8e4ddd90768a |  
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.436474728Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.438060528Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.440570248Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.499 | 2025-04-02T08:17:09.442875108Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.444941818Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.446763518Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.448803747Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.450165687Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="exporting bytes" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t export_path=/.daggermod/output.json session_id=upj8tq47ycru3l30e2wl3brxp spanID=4e333bfd3d16159e traceID=bdef6ab72fd64bab36fd8e4ddd90768a |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.450766017Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.452290437Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t contentType=application/json method=POST path=/query session_id=upj8tq47ycru3l30e2wl3brxp span=5c7d2aaf8f5fad21 spanID=5c7d2aaf8f5fad21 trace=bdef6ab72fd64bab36fd8e4ddd90768a traceID=bdef6ab72fd64bab36fd8e4ddd90768a upgradeHeader= |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.453613457Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="exporting bytes" client_hostname=buildkitsandbox client_id=jexqvofz6dv7gracrefxgfv7t export_path=/.daggermod/error session_id=upj8tq47ycru3l30e2wl3brxp spanID=23566e8d042c0b56 traceID=bdef6ab72fd64bab36fd8e4ddd90768a |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.458392437Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="container done" args="[/.init /runtime]" call_id="xxh3:807291e9fa3afd32" caller_client_id=rw8rezhl2letrzls0w32q0kn3 error="exit code: 2" id=dby4f56jys329wm2temkn90c0 nested_client_id=jexqvofz6dv7gracrefxgfv7t span=/runtime spanID=44d1798e7197f9c5 traceID=bdef6ab72fd64bab36fd8e4ddd90768a |  
  |   | 2025-04-02 08:17:09.500 | 2025-04-02T08:17:09.458424486Z stderr F time="2025-04-02T08:17:09Z" level=error msg="executor run" error="exit code: 2" span=/runtime spanID=44d1798e7197f9c5 traceID=bdef6ab72fd64bab36fd8e4ddd90768a |  
  |   | 2025-04-02 08:17:09.749 | 2025-04-02T08:17:09.500719113Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="exporting metrics to clients" clients=1 |  
  |   | 2025-04-02 08:17:09.749 | 2025-04-02T08:17:09.555587928Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="handling http request" client_hostname=dagger-engine-ttltr client_id=kq3qmhl03j7zffn4xm4t509rk contentType= method=GET path=/sessionAttachables session_id=i5zzy0i661u0k5w8epa8j0g0g span=6463b059c40d97c2 spanID=6463b059c40d97c2 trace=f2258af756a74f5f359726080d7c7f37 traceID=f2258af756a74f5f359726080d7c7f37 upgradeHeader= |  
  |   | 2025-04-02 08:17:09.749 | 2025-04-02T08:17:09.558224258Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="exporting metrics to clients" clients=1 |  
  |   | 2025-04-02 08:17:09.749 | 2025-04-02T08:17:09.612470333Z stderr F time="2025-04-02T08:17:09Z" level=debug msg="exporting metrics to clients" clients=1

aweris avatar Apr 02 '25 08:04 aweris

Have a fix that's reasonably likely to be the main root cause for this here: https://github.com/dagger/dagger/pull/10162

sipsma avatar Apr 15 '25 01:04 sipsma

This does seem to have done it, I'm no longer seeing this flake!

jedevc avatar May 08 '25 11:05 jedevc