sentry icon indicating copy to clipboard operation
sentry copied to clipboard

chore: reapply reverted orjson changes

Open anonrig opened this issue 1 month ago • 10 comments

Reapplies orjson changes and squashes them into single PR.

No rush to merge this. Opening now to keep track of changes.

anonrig avatar May 20 '24 19:05 anonrig

Codecov Report

Attention: Patch coverage is 87.58170% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 77.85%. Comparing base (f413f1b) to head (ae20c79). Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #71185   +/-   ##
=======================================
  Coverage   77.85%   77.85%           
=======================================
  Files        6532     6532           
  Lines      291038   291056   +18     
  Branches    50359    50362    +3     
=======================================
+ Hits       226584   226601   +17     
+ Misses      58222    58219    -3     
- Partials     6232     6236    +4     
Files Coverage Δ
src/sentry/backup/crypto.py 91.02% <100.00%> (-0.06%) :arrow_down:
src/sentry/backup/imports.py 92.69% <100.00%> (-0.04%) :arrow_down:
src/sentry/backup/sanitize.py 90.90% <100.00%> (ø)
src/sentry/buffer/redis.py 88.23% <100.00%> (ø)
src/sentry/cache/redis.py 100.00% <100.00%> (ø)
src/sentry/charts/chartcuterie.py 80.76% <100.00%> (ø)
src/sentry/db/models/fields/array.py 100.00% <100.00%> (ø)
src/sentry/db/models/fields/gzippeddict.py 70.00% <100.00%> (ø)
src/sentry/db/models/fields/jsonfield.py 87.77% <100.00%> (ø)
src/sentry/eventstream/snuba.py 85.36% <100.00%> (+0.11%) :arrow_up:
... and 26 more

... and 5 files with indirect coverage changes

codecov[bot] avatar May 20 '24 20:05 codecov[bot]

I don't believe its possible to merge this in Sentry's present state. This change passes CI but caused a crashloop in S4S. That seems to be the worst of it but there could be more subtly wrong behavior.

A better approach would be to find hot code paths, upgrade the JSON parser to orjson, add lots of test coverage, and observe it running in S4S.

I don't think these sweeping migrations are safe or wise.

Hi @cmanallen,

Our CI pipeline did what it's aimed to achieve and caught this bug on S4S without impacting production environment. Due to quantity of the migration, I opened several pull-requests to land orjson, but due to the lack of coverage in specific areas, it didn't get caught in the CI.

We followed a similar pattern where we migrated hot paths in the past month (starting from April 15, referencing https://github.com/getsentry/sentry/issues/68903). We first landed the orjson changes behind a feature flag with slow rollout, which gave us confidence that the changes are safe to land. Only then, we started applying changes directly, which resulted in this issue being caught.

In the current situation or simply in any pull-request, I don't think it's possible to ensure there is no breaking changes. Even the code coverage states that the line that is caused by this breakage has tests covering it and the file has 98% test coverage. Referencing: https://app.codecov.io/gh/getsentry/sentry/blob/master/src%2Fsentry%2Futils%2Foutcomes.py#L106

FYI, this pull-request already includes the fix that's caused by the datettime parser. Evan already landed a fix to master branch to properly parse ISO-8601 timestamps, and I've opened another pull-request to improve it further (ref: https://github.com/getsentry/sentry/pull/71187).

Would you mind helping me understand what is the ideal next step is from your point of view?

anonrig avatar May 20 '24 20:05 anonrig

@anonrig

My big fear is this:

due to the lack of coverage in specific areas, it didn't get caught in the CI.

To compensate we should focus on changing the smallest amount of code possible which achieves a desired affect. That may mean only updating one file (plus adding test coverage) but that one file may handle millions of requests per minute (or whatever) so the impact is quite large. We can't catch everything (the date issue might not have been caught for example) but we can at least limit the possible number of ways we can fail.

I haven't blocked this PR so you're welcome to override me 😄. I don't plan on reviewing this simply because I believe the surface area is too large. I can't reasonably be sure it will work as expected and so can't vouch for it.

It sounds like you're being diligent in the deploys and I'm glad this was caught in S4S. I'm sure we can get this orjson work out there one way or another 💪

cmanallen avatar May 20 '24 20:05 cmanallen

To compensate we should focus on changing the smallest amount of code possible which achieves a desired affect. That may mean only updating one file but that one file may handle millions of requests per minute (or whatever) so the impact is quite large. We can't catch everything (the date issue might not have been caught for example) but we can at least limit the possible number of ways we can fail.

@cmanallen

There is currently 283 from sentry.utils import json in master branch. It is not possible to open 283 pull-requests in a timely manner. If we want to roll this with feature flags we need minimum 7 pull-requests per change, which makes a total of 1981 pull-requests. On top this, we also need to give 2 days for S4S and a day on production to "100% be sure it doesn't introduce a bug"

FYI: Here's the safest flow:

  1. make the change on sentry repo
  2. add feature flag with 0 to options-automator
  3. set the feature flag to 100 on S4S
  4. set the feature flag to 100 on prod
  5. remove the experiment on sentry
  6. remove the feature flag from options-automator
  7. remove the feature flag from sentry

To ensure how we end up here, I recommend looking into the history of https://github.com/getsentry/sentry/issues/68903. (We did 15+ experiments before having the confidence to stop experimenting)

To summarize my opinion on this, it is simply not possible to make a change of this size with 100% safety.

anonrig avatar May 20 '24 20:05 anonrig

It is not possible to open 283 pull-requests in a timely manner

maybe you can chunk it. instead of opening a PR with 283 changes, or opening 283 PRs with 1 change, open N PRs with 283 / N changes. If those chunks are aligned with module structure, it at least reduces the likelihood that you have to deal with 2 orjson-related incidents at the same time.

untitaker avatar May 21 '24 11:05 untitaker

I'm with @untitaker on this one

would definitely prefer individual PRs and at least each stakeholder individually can approve /rollback their individual instances?

nhsiehgit avatar May 21 '24 19:05 nhsiehgit

I'm with @untitaker on this one

would definitely prefer individual PRs and at least each stakeholder individually can approve /rollback their individual instances?

How do you recommend splitting the changes? Originally, this PR includes 3 reverted PRs.

anonrig avatar May 21 '24 19:05 anonrig

to be clear, I am only presenting this as an idea, i do not want to dig my heels in and declare that this PR is too risky. ultimately i think the PR author owns the outcome and therefore should also get to decide.

How do you recommend splitting the changes? Originally, this PR includes 3 reverted PRs.

roughly by folder or codeowner. this also avoids PRs with a lot of review requests, and lots of review requests invites a lot of discussion 😅

untitaker avatar May 21 '24 19:05 untitaker

I want to acknowledge that it cost more effort/time for the PR author to split up it up a low-level library change.

However, it is good operational practice to communicate with each product team, slowly merge in each PRs, and monitor for errors or unexpected user bug reports in a time window.

#INC-779 is a direct example where a similar change broke existing integrations in a subtle manner. The effort/time cost was offloaded to the product teams, as we had to commit people in the morning to debug and definitively prove that it was the parser causing the problem.

It will build a lot of goodwill if you give each team a heads up that you're merging a low-level change in our area, so we are aware and can help to monitor the rollout.

leedongwei avatar May 21 '24 20:05 leedongwei

Code-wise, this looks good to me. Chiming in on the general discussing, here is my feedback:

  • You mentioned in a call, and its also visible in the commits that you originally had 3 separate PRs, and only one of those was rejected by the S4S staging environment. As most people are complaining about the size of this PR (rightly so, as it touches 60 files), splitting this up into at least the original PRs makes sense, and it should be easier to re-land the PRs that (presumably) did not break anything.
  • Just from intuition, the picklefield might warrant more thorough review / testing.
  • I see you sprinkle the options related to UTC dates, and non-str keys in a rather whack-a-mole style.
  • There is both pros and cons to replacing sentry.utils.json with a direct import. The pro is that you can do this one caller at a time. Although there is also an advantage to having one central location, mainly that you can change the implementation without having to touch 60+ files.

In general, with the advantage of hindsight, I would have approached this whole initiative differently:

  • Test your assumption on an isolated piece of code, which you did, and it provided great improvement! 🎉
  • Change the abstraction to output (json.dumps) fully spec compliant JSON with string keys, and properly formatted dates, keeping it API compatible (return str instead of bytes).
  • Change the abstraction to use orjson under the hood in compatibility mode when reading (json.loads), meaning it should accept non-spec compliant JSON, and keep the API as-is (though I believe the reader side does not care about str vs bytes anyway?)

Assuming that the compatiblity mode works properly, we have the following advantages:

  • No need to codemod the whole codebase, as its going through a central abstraction.
  • No API changes (str vs bytes).
  • No change when reading existing JSON (assuming compatiblity mode works properly).
  • Outputs more spec compliant JSON.
  • Perf should still improve across the board, and the additional wins from using bytes or accepting more JSON should be negligible.

If you can really prove that avoiding str or compatibility mode in certain scenarios, you can still opt into that for specific pieces of code.

Swatinem avatar May 22 '24 08:05 Swatinem

@anonrig , as we discussed here is a suggestion on how to split this PR logically in a finite number of PRs:

  • Everything in sentry/backup Which will be reviewed by @getsentry/open-source and they will tell you how to validate.
  • /src/sentry/buffer/redis.py. The buffers feature is actually quite high volume and critical. Worth testing in isolation. I'd even consider a feature flag for this.
  • /src/sentry/cache/redis.py. This is high volume as well.
  • src/sentry/charts/chartcuterie.py. There should be a code owner you can get help to review.
  • One PR per file here src/sentry/db/models/fields/. Those populate fields in the DB we can break a lot of things especially when writing. You may look into how often those fields are used to see how critical this is. If they are not used a lot then it is easier.
  • everything in sentry/monitors. One team owns that.
  • one for sentry/shared_integrations
  • one for src/sentry/plugins
  • src/sentry/utils/assets.py This is heavily used
  • src/sentry/utils/codecs.py This is heavily used
  • src/sentry/utils/outcomes.py This is heavily used
  • src/sentry/utils/email This is heavily used
  • Everything else

What do you think about this ?

fpacifici avatar May 22 '24 22:05 fpacifici