opa decision log should include values from non-deterministic built-in functions like time.now

decision log should include values from non-deterministic built-in functions like time.now_ns

Open tsandall opened this issue 5 years ago • 16 comments

Decision log events should include all of the information necessary to re-execute policy evaluation. Today if policies depend on time.now_ns() or other (custom) non-deterministic built-in functions, it is not easy to accurately re-execute policy evaluation. We should extend the decision log event structure to include additional values supplied to or computed during policy evaluation that are not contained in the input, policy, or contextual data (e.g., time.now_ns() result, opa.runtime() result, http.send() result, etc.)

Jun 21 '19 15:06 tsandall

Built-in Functions with non-deterministic outputs

rand.intn (relies on RNG)
~io.jwt.encode_sign, io.jwt.encode_sign_raw (relies on RNG internally)~
~io.jwt.decode_verify (relies on time internally)~
time.now_ns (relies on time directly)
http.send (unpredictable HTTP request results)
net.lookup_ip_addr (unpredictable DNS resolution results)
opa.runtime (unpredictable config/environment-dependent results)
uuid.rfc4122 (relies on RNG)

Based on a cursory look through the built-ins list, these are the functions I found that have obvious non-deterministic behavior. The rest should be deterministic.

Jul 18 '22 23:07 philipaconrad

FWIW the RNG source should come from the BuiltinContext. We've had to bend io.jwt.encode_sign and _raw to use it, though: https://github.com/open-policy-agent/opa/commit/e732b0b4823490a171f347560f1f4e13c509bd44. But controlling that alone doesn't give us completely reproducible signatures, since there's an uncontrollable-random step that will either discard or use the first byte. (See this comment) The code path for this is Sign -> randutil.MaybeReadByte

Jul 19 '22 07:07 srenatus

Interesting. I guess it makes sense for consistency, but do users really want reproducible token signing? Would that not entail having their private keys logged with each decision? 🤔

Jul 19 '22 08:07 anderseknert

Hah, good point. I guess it would make sense to not worry about the io.jwt.encode_sign[_raw] builtins too much.

But thinking about this again, we also can't ever control http.send and net.lookup_ip_addr's inputs (those being outside servers) -- so, reading the top issue again, the proper way to go here might be to record the outcomes, and provide ways to stub those out for decision log replay. For that, the RNG doesn't matter as much -- we need to know what the result was -- and then io.jwt.encode_sign[_raw] aren't so different from the other builtins @philipaconrad has collected there.

Jul 19 '22 09:07 srenatus

Agreed, persisting the output of those rather than the input seems like a better approach 👍

Jul 19 '22 12:07 anderseknert

I'm considering a simple K/V cache for implementing aspects of this. Do we want a max size restriction on the cache? I'd assume that if someone enables recording non-deterministic builtin results that memory consumption isn't exactly a great concern, since you wouldn't turn it on unless you wanted to get a potentially large pile of results. :thinking:

Jul 21 '22 16:07 philipaconrad

Also, because this feature will potentially result in massive additional memory consumption (and equally massive decision logs in some cases!), I am planning to gate this feature of decision logging behind a config file option, like decision.record_non_deterministic_builtin_results = true/false, for example.

Jul 21 '22 16:07 philipaconrad

Would it be that much, though? One time ns integer if time us used, one string for each encode_sign call... I suppose it would go into the decision logs entry, so we don't need to hold onto it anywhere else. 🤔

IIRC decision logs are flushed when their buffer is full, and that would then occur more often...?

Jul 21 '22 17:07 srenatus

Hmm http.send is different, though. Those response payloads can be very large.

Jul 21 '22 17:07 srenatus

Hmm http.send is different, though. Those response payloads can be very large.

Right, I'm thinking of the cases where a user hits several unique webpages.

IIRC decision logs are flushed when their buffer is full, and that would then occur more often...?

This would definitely increase how often those buffers flush, to such a degree that I think it's worth feature-gating behind a config option, with the default being to leave it turned off. :cold_sweat:

Jul 21 '22 18:07 philipaconrad

My current plan (I'm copying this over from the design doc) is to only cache the first result from a non-deterministic builtin. That at least bounds how much memory we're burning to cache stuff. The main builtins that might be affected by this behavior are probably http.send, net.lookup_ip_addr, and possibly opa.runtime(), since those all could potentially have non-idempotent results.

I think this is a reasonable, scope-limiting behavior decision because if a single webpage gives back different results each time it's queried with http.send, there's no way to replicate that in a decision replay down the road unless you're recording every result from http.send, and then we're basically just doing selective tracing with more steps, and will still have to account for cases where the user queries that webpage one more time than we have recorded results for. :slightly_frowning_face:

Jul 21 '22 18:07 philipaconrad

because if a single webpage gives back different results each time it's queried with http.send,

I don't think we'd notice 🙃 We shouldn't do two requests for

p {
    obj := {"method": "GET", "url": "https://example.com"}
    http.send(obj)
    http.send(obj)
}

so intra-query caching should make us not send two requests here (even if it was different rules etc; same input to http.send implies only one result, only one request sent. -- here's the code, I think

Same for net.lookup_ip_addr, see this.

Jul 21 '22 19:07 srenatus

Well, that's perfect then! Since we already have caching in place for these things, it's mostly going to be an exercise in plumbing work to get those cached values up to where they need to go for decision logging.

Jul 22 '22 19:07 philipaconrad

Is this fixed already? I thought the merged PR was the first step....

Sep 01 '22 18:09 srenatus

Ah, nuts. I accidentally left in the Fixes #1514 in the PR, because when it was first written, the plan had been to go all the way to decision logging integration. Yeah, let's reopen this one.

Sep 01 '22 18:09 philipaconrad

We'll put this zombie issue back to rest when decision logging integrates the non-deterministic builtin cache work from #4926. :zombie:

Sep 01 '22 18:09 philipaconrad

opa opa copied to clipboard

decision log should include values from non-deterministic built-in functions like time.now_ns

Built-in Functions with non-deterministic outputs

opa
opa copied to clipboard