envoy
envoy copied to clipboard
Add the sample() convenience function in CEL expression builder
Commit Message: add random convenience function in CEL expression builder
Additional Description: Add the random() function that allows CEL expressions to call the Random::RandomGenerator::random() function to generate a random 64-bit unsigned integer
Risk Level: Medium
Testing: unit tests
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes #34805
Fixes istio/istio#50404
Might be more convenient to pass the Bernoulli value as a float instead of providing the uint64 value.
Hmm, there was a bug where if constant folding is enabled, it will inline the function. Thinking of mitigations. Some things that can possibly be done:
- Close this PR and simply provide the random number through an attribute
- Change the function to be lazily evaluated (tried this initially but passing the Random::RandomGenerator at multiple locations became cumbersome. See my other branch)
- Push a change upstream to allow marking functions to not be inlined Also, not sure if this function really aligns well with the cel-spec philosophy of functions being "side-effect free"
/retest
/assign @tyxia as codeowner / maintainer
neither of as, codeowner, /, maintainer can be assigned to this issue.
/assign @tyxia
as codeowner / maintainer
Hmm, there was a bug where if constant folding is enabled, it will inline the function. Thinking of mitigations. Some things that can possibly be done:
- Close this PR and simply provide the random number through an attribute
- Change the function to be lazily evaluated (tried this initially but passing the Random::RandomGenerator at multiple locations became cumbersome. See my other branch)
- Push a change upstream to allow marking functions to not be inlined Also, not sure if this function really aligns well with the cel-spec philosophy of functions being "side-effect free"
I don't think we should just disable constant folding (which is performance optimization) just for this random() feature. If it conflicts here, then this is first thing of this PR that we want to discuss.
Add @kyessenov , Kuat's insights would be appreciated here.
FWIW I do have another branch with lazy loading of functions that doesn't require disable constant folding.
I think either option (2) or (3) are reasonable. A new attribute only solves a subset of cases when you want the same consistent random value for all evaluations for a request.
@jcking might have a wider context on how randomness is added to CEL.
It may still be useful to have (1), especially if it is based off of x-request-id in which case you can make a decision consistently across multiple envoy hops. Though, this is not related to this PR or the parent issue.
I'm looking into CEL solutions in other contexts - but I tend to agree that a Bernoulli value sample(p) that returns true with probability p might be easier to use. That solves some constant-folding issues but not entirely absolves non-deterministic evaluation. We can also make it integer-based using https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/v3/percent.proto#envoy-v3-api-msg-type-v3-fractionalpercent, e.g. sample(p, 10000) where p is between 0 and 10000.
I had an internal discussion with some CEL folks. We think a pure random() function is likely to cause issues in general that won't be solved by lazy functions. CEL expressions generally are pre-processed in multiple places, not just Envoy, and a compiler at any place in the pipeline is free to perform transformations to improve performance. CEL functions are pure and deterministic, so if you have multiple calls to random() they can be elided and/or combined in unexpected ways. So even if you make evaluation assume lazy function in Envoy, the control plane can still make a dangerous transformation.
What we think would work best is to turn this into a true pure function by adding parameters. One example would be a bool function sample(r, p, i) where:
ris an opaque attribute mapping to the random generator. User is expected to pass an attribute that is not known ahead of time, e.g.request.pis a Bernoulli ratio. For simplicity we can make this a numerator over 10000.iis a sampling count. This makes repeated sampling, e.g. in aforloop, explicit. If we only sample once per expression, we can elide this parameter.
Ya, I was thinking a similar thing. To not be too computationally expensive, I was thinking under the hood, for each CEL expression, we generate one random number for each invocation of an expression (can be stored in the r attribute) that we use to seed some prng where we pass i.
Edit: I ultimately lack the expertise in doing this securely; if I go ahead and try to implement this, I will need some assistance at auditing it.
Thinking of an implementation where we generate a random number as a seed before the execution of a CEL expression, apply a cryptographic hash with that seed concatenated with a index (e.g. sha256), then convert the hash into a probability. So
sample(r, p, i) = (sha256(r.seed + i)[..2] / u64max) < p
As a user's point of view, I think this is a good proposal to implement percentage-based log sampling.
/wait
Probably some optimization should be considered? E.g.
- [ ] Lazily generating a random number to serve as a seed to avoid generating a seed for every cel expression, even those that don't need it?
- [ ] Keep a small buffer for a small range of indices to avoid using the SHA technique when not necessary?
/retest
/retest
Please merge main to fix conflicts. /wait
/retest
@wtzhang23 coverage fail looks real
Yep, need to get the coverage report locally to check. Unless it is already uploaded and I couldn't find the artifact.
/coverage
Coverage for this Pull Request will be rendered here:
https://storage.googleapis.com/envoy-pr/35180/coverage/index.html
The coverage results are (re-)rendered each time the CI envoy-presubmit (check linux_x64 coverage) job completes.
Bump on this? I no longer need this feature and I'd like to get this off my hands. I'm ok with closing this if it is not a good approach (I'm slightly uncomfortable with it to be honest, though I don't know of another way to conform to the CEL spec).
Sorry about the delay. I'm not confident this is the right approach TBH. I would be more comfortable if this was an extension, but we haven't added an ability to extend CEL vocabulary. I am worried that adding this without a concrete use case would hurt us long term, especially since you added functions to the top-level request attribute. But I also don't want this work go to waste. Should we first work out a custom vocabulary in CEL and then provide this as a custom extensions instead?
Yes, I agree with your assessment.
/wait
Need some way to make this opt-in.
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!