grist-core
grist-core copied to clipboard
Stop using Redis to track the expected version of a document in s3
This part of the code https://github.com/gristlabs/grist-core/blob/9d00ee9ba7443b0d4c62d3f17627dfdecec46af1/app/server/lib/ExternalStorage.ts#L224-L237
makes a document crash if the checksum stored in redis doesn't match the one retrieved from s3.
Since most s3 implementations now have strong consistency guarantees, @paulfitz suggests that maybe this behaviour is not needed anymore.
If we want to keep using redis to store document versions, my suggestion would be to turn this error log into a warning and maybe let the download happen anyways and trust s3 by default ?
Hi @paulfitz!
As I understand, we must be able to remove the part of the above check while downloading.
I was wondering if we could also skip the checksum computation while uploading. But it looks like the checksum is used to prevent useless uploads (when the file to be uploaded is unchanged): https://github.com/gristlabs/grist-core/blob/9d00ee9ba7443b0d4c62d3f17627dfdecec46af1/app/server/lib/ExternalStorage.ts#L158-L166
Do you have any thoughts about this?
If we want to keep using redis to store document versions, my suggestion would be to turn this error log into a warning and maybe let the download happen anyways and trust s3 by default ?
That alternative seems also fair :).
My guess would be that removing the checksum entirely would be a bigger change, and that it may still have utility. Converting the error to a warning seems less disruptive.
I'd also plan to do a fair bit of testing after the change. I'm confident the original motivation for this behavior is gone, but that doesn't guarantee that removing it will be trouble-free (like removing someone's appendix).
@paulfitz I opened this PR: #767
However, the more I think about it, the more I have doubts about the usefulness of this flag I proposed in this PR (GRIST_DISCARD_REDIS_CHECKSUM_MISMATCH): I don't feel like letting some administrators enable it would let us be more confident to remove it at some point and by default only warn the user. Probably it would be better letting you do your own tests.
I close the PR for now, unless you tell me that's has some interest. I wonder what are the tests you intend to do and if I can help somehow.
Happy thanksgiving!
@paulfitz would you like us to re-open Florent's PR so we can run some tests ? I still believe that simplifying the architecture around version consistency check would be useful
Sorry for missing the pings on this thread.
I think it would be worth running the tests in test/server/lib/HostedStorageManager.ts with Florent's flag set. There may be an early test in it that needs to be skipped since it specifically probes redis consistency checking, but I think the bulk should pass.
For our SaaS, we run a battery of the nbrowser tests against an AWS S3 configuration. I'd also like to confirm they pass with Florent's flag set. We could do that check.
Then I'd want to go through a period where we set the variable on the staging deployment we use for all Grist Labs work, so it gets a good work-out on real workloads against AWS S3.
At that point, as far I'm concerned, the new behavior could be made the default and the check could just be deleted.
Thanks @paulfitz for your feedback! I reopened the PR https://github.com/gristlabs/grist-core/pull/767 and run the tests as you suggested. The first test failed.
I adapted the test so it checks with and without the flag, and it passes now.
Thanks @fflorent for pushing this forward. I'm working through the steps mentioned in https://github.com/gristlabs/grist-core/issues/751#issuecomment-1889557946
- [x] support GRIST_SKIP_REDIS_CHECKSUM_MISMATCH flag
- [x] check full battery of Grist Lab's SaaS tests with flag set
- [x] run Grist Lab's staging servers with flag set for a while
Once through that, we at GL will be fine with the default behavior changing.
So ... Can we close this issue ? :smile:
So ... Can we close this issue ? 😄
Yesterday, I misread the last Paul's todo item:
run Grist Lab's staging servers with flag set for a while
I don't know how much time we should wait, but probably a month is not enough. Also this issue will be closed once the default behavior is changed as Paul stated.
This PR is just a step toward this goal but not the goal itself.
Thanks for reminder about this, working now on getting that last checkbox ticked off.
@fflorent testing in https://github.com/gristlabs/grist-core/issues/751#issuecomment-1998095415 looks fine. It would be ok by Grist Labs for the default behavior to change.
@paulfitz We would be happy to open a new PR to make this the default behavior.
Edit: would it be acceptable for you a PR which removes the GRIST_DISCARD_REDIS_CHECKSUM_MISMATCH env variable btw and which discards the redis checksum in all the cases?
Edit: would it be acceptable for you a PR which removes the GRIST_DISCARD_REDIS_CHECKSUM_MISMATCH env variable btw and which discards the redis checksum in all the cases?
This is acceptable, yes. Thanks for following up on this @fflorent