dagger Fix secret transfer issues between modules

This PR contains a few fixes, extracted out of #8149, and others.

Added two changelogs, since there are two distinct fixes here, but not worth splitting them out into separate PRs, since they're both part of the same effort here :laughing:

Both issues were introduced in https://github.com/dagger/dagger/pull/7804.

There's a rather nasty bug with // +private fields (and the equivalent in python/typescript). If an ID is stored in there, we have no way of knowing it's there.

As discussed in discord, it's not feasible to expose each private field to the API - there may be data in there that cannot be exposed to the API (e.g. today maps are a big case), but there may definitely be others. For backwards-compat reasons, it's probably way too much effort to try and expose the structure of those private fields - we've made our bed, now we have to lie in it.

This PR is a draft idea to do a "best-effort" detection - we look through the returned JSON objects, and scan for things that look enough like IDs - then we try and grab secrets and sockets from there. This resolves the issue, but I'm also not entirely convinced it's the right approach - it feels very hacky, and I'm somewhat worried we might misinterpret something as an ID when it shouldn't be.

Some alternative ideas I've considered (ranging from worse->better, easier->harder):

Update all IDs to have a prefix - something like id: - this hopefully reduces false positives, but also strings could legitimately have this prefix.
Don't encode IDs passed to modules as strings, encode them as a special object, something like { "_dagger.id": "<id>" }. Again, we're just reducing false positives, modules could definitely still produce these.
Dump all the IDs that we know about to a side-channel (any time we get an ID, record it, then when we're traversing fields, we know exactly what we're looking for)
Roll our own module interchange format (or find one) that allows for typed data. This is very tricky, since it needs to be supported in each language natively.

Also, we have a caching bug, described in more detail here, which we also fix here as well.

Sep 06 '24 11:09 jedevc

Hm, even the // +private removal doesn't solve everything.

Turns out, we can very easily hit this case: https://github.com/dagger/dagger/pull/7804#issuecomment-2216493990 (and I've pushed a test for it in https://github.com/dagger/dagger/pull/8358/commits/1b8c1758faea4171b0d44c02b8313d6d1dfcb88b). We incorrectly assume that if a call is cached (and so setSecret never ran), then it must already be in our secret store.

This isn't the case in the following scenario (which I do in the test):

Client 1 calls Foo and Bar functions in module Test (in parallel, in serial, it doesn't matter)
Client 2 (during evaluation of Foo) calls function Mount in module Dep, which in turn creates Client 3, calls setSecret and returns the Secret. After evaluation, we inspect the ID, and load the secret from Client 3 into Client 2.
Client 4 (during evaluation of Bar) calls function Mount again, but this time it's cached - no setSecret happens. We still get the same ID though, but when we try and load the secret, it doesn't exist in Mounts client (Client 5, which never gets used).

This is very fiddly, and I'm not 100% sure what the fix to this looks like.

Even if we actually perform the ID load as suggested in https://github.com/dagger/dagger/pull/7804#issuecomment-2234090051 (using a different dagql server, and correctly mock a fake client so that we can actually correctly load the secret into Client 5, we hit the issue that setSecret isn't even in the ID, so we'd never actually insert it into Client 5 (we'd just be trying to access it, which would now fail, since it hadn't been created in this secret store). I think as long as we keep this separation where we return a secret instance from setSecret, this isn't fixable using that technique.

Another option (this one might work :tada:), could be to have a Call additionally return the client ID that it actually used. This is neat, because now when we cached, we would ignore the randomly generated ID for Client 5, and could instead get the ID of Client 4 and load secrets from there. This still doesn't solve the generalized caching of function calls #7428, but gets us over this hurdle temporarily.

However, it feels to me that if we want to solve the general caching for #7428, we're going to need to make setSecret "super-impure" (name tbd :laughing:) - if a function calls it, then we cannot cache that result. This is a bit trickier than impure currently is, because it passes on it's impurity, and we don't know if a function did call this until after it ran. This makes some amount of sense to me - e.g. if a Secret gets cached without a corresponding setSecret anywhere, we can't re-evaluate anything that depends on it - you then get into this weird state where re-builds may work without a valid secret, or may not, entirely dependent.

Gonna have a bit more of a think about this, I'm leaning towards the final option, since I think we're gonna need to do it anyways one day.

Sep 12 '24 11:09 jedevc

Another option (this one might work 🎉), could be to have a Call additionally return the client ID that it actually used. This is neat, because now when we cached, we would ignore the randomly generated ID for Client 5, and could instead get the ID of Client 4 and load secrets from there.

This turns out to be very easy to implement, with our new worker setup - we just write a special meta file to record this.

We also need to disable dagql-level caching of function calls, which is sad, but we need more complex logic to handle the secret transfers, and if we keep this enabled, then we have no-where to put it.

This is a bit trickier than impure currently is, because it passes on it's impurity, and we don't know if a function did call this until after it ran.

This is fiddly. I don't really have any good ideas for how we would even do this - so deferred this to some generic point in the future.

Sep 13 '24 11:09 jedevc