opa
opa copied to clipboard
decision log should include values from non-deterministic built-in functions like time.now_ns
Decision log events should include all of the information necessary to re-execute policy evaluation. Today if policies depend on time.now_ns() or other (custom) non-deterministic built-in functions, it is not easy to accurately re-execute policy evaluation. We should extend the decision log event structure to include additional values supplied to or computed during policy evaluation that are not contained in the input, policy, or contextual data (e.g., time.now_ns() result, opa.runtime() result, http.send() result, etc.)
Built-in Functions with non-deterministic outputs
-
rand.intn
(relies on RNG) - ~
io.jwt.encode_sign
,io.jwt.encode_sign_raw
(relies on RNG internally)~ - ~
io.jwt.decode_verify
(relies on time internally)~ -
time.now_ns
(relies on time directly) -
http.send
(unpredictable HTTP request results) -
net.lookup_ip_addr
(unpredictable DNS resolution results) -
opa.runtime
(unpredictable config/environment-dependent results) -
uuid.rfc4122
(relies on RNG)
Based on a cursory look through the built-ins list, these are the functions I found that have obvious non-deterministic behavior. The rest should be deterministic.
FWIW the RNG source should come from the BuiltinContext. We've had to bend io.jwt.encode_sign
and _raw
to use it, though: https://github.com/open-policy-agent/opa/commit/e732b0b4823490a171f347560f1f4e13c509bd44. But controlling that alone doesn't give us completely reproducible signatures, since there's an uncontrollable-random step that will either discard or use the first byte. (See this comment) The code path for this is Sign -> randutil.MaybeReadByte
Interesting. I guess it makes sense for consistency, but do users really want reproducible token signing? Would that not entail having their private keys logged with each decision? 🤔
Hah, good point. I guess it would make sense to not worry about the io.jwt.encode_sign[_raw]
builtins too much.
But thinking about this again, we also can't ever control http.send
and net.lookup_ip_addr
's inputs (those being outside servers) -- so, reading the top issue again, the proper way to go here might be to record the outcomes, and provide ways to stub those out for decision log replay. For that, the RNG doesn't matter as much -- we need to know what the result was -- and then io.jwt.encode_sign[_raw]
aren't so different from the other builtins @philipaconrad has collected there.
Agreed, persisting the output of those rather than the input seems like a better approach 👍
I'm considering a simple K/V cache for implementing aspects of this. Do we want a max size restriction on the cache? I'd assume that if someone enables recording non-deterministic builtin results that memory consumption isn't exactly a great concern, since you wouldn't turn it on unless you wanted to get a potentially large pile of results. :thinking:
Also, because this feature will potentially result in massive additional memory consumption (and equally massive decision logs in some cases!), I am planning to gate this feature of decision logging behind a config file option, like decision.record_non_deterministic_builtin_results = true/false
, for example.
Would it be that much, though? One time ns integer if time us used, one string for each encode_sign call... I suppose it would go into the decision logs entry, so we don't need to hold onto it anywhere else. 🤔
IIRC decision logs are flushed when their buffer is full, and that would then occur more often...?
Hmm http.send is different, though. Those response payloads can be very large.
Hmm http.send is different, though. Those response payloads can be very large.
Right, I'm thinking of the cases where a user hits several unique webpages.
IIRC decision logs are flushed when their buffer is full, and that would then occur more often...?
This would definitely increase how often those buffers flush, to such a degree that I think it's worth feature-gating behind a config option, with the default being to leave it turned off. :cold_sweat:
My current plan (I'm copying this over from the design doc) is to only cache the first result from a non-deterministic builtin. That at least bounds how much memory we're burning to cache stuff. The main builtins that might be affected by this behavior are probably http.send
, net.lookup_ip_addr
, and possibly opa.runtime()
, since those all could potentially have non-idempotent results.
I think this is a reasonable, scope-limiting behavior decision because if a single webpage gives back different results each time it's queried with http.send
, there's no way to replicate that in a decision replay down the road unless you're recording every result from http.send
, and then we're basically just doing selective tracing with more steps, and will still have to account for cases where the user queries that webpage one more time than we have recorded results for. :slightly_frowning_face:
because if a single webpage gives back different results each time it's queried with http.send,
I don't think we'd notice 🙃 We shouldn't do two requests for
p {
obj := {"method": "GET", "url": "https://example.com"}
http.send(obj)
http.send(obj)
}
so intra-query caching should make us not send two requests here (even if it was different rules etc; same input to http.send
implies only one result, only one request sent. -- here's the code, I think
Same for net.lookup_ip_addr
, see this.
Well, that's perfect then! Since we already have caching in place for these things, it's mostly going to be an exercise in plumbing work to get those cached values up to where they need to go for decision logging.
Is this fixed already? I thought the merged PR was the first step....
Ah, nuts. I accidentally left in the Fixes #1514
in the PR, because when it was first written, the plan had been to go all the way to decision logging integration. Yeah, let's reopen this one.
We'll put this zombie issue back to rest when decision logging integrates the non-deterministic builtin cache work from #4926. :zombie: