grist-core icon indicating copy to clipboard operation
grist-core copied to clipboard

Stop using Redis to track the expected version of a document in s3

Open vviers opened this issue 2 years ago • 13 comments

This part of the code https://github.com/gristlabs/grist-core/blob/9d00ee9ba7443b0d4c62d3f17627dfdecec46af1/app/server/lib/ExternalStorage.ts#L224-L237

makes a document crash if the checksum stored in redis doesn't match the one retrieved from s3.

Since most s3 implementations now have strong consistency guarantees, @paulfitz suggests that maybe this behaviour is not needed anymore.

If we want to keep using redis to store document versions, my suggestion would be to turn this error log into a warning and maybe let the download happen anyways and trust s3 by default ?

vviers avatar Nov 15 '23 09:11 vviers

Hi @paulfitz!

As I understand, we must be able to remove the part of the above check while downloading.

I was wondering if we could also skip the checksum computation while uploading. But it looks like the checksum is used to prevent useless uploads (when the file to be uploaded is unchanged): https://github.com/gristlabs/grist-core/blob/9d00ee9ba7443b0d4c62d3f17627dfdecec46af1/app/server/lib/ExternalStorage.ts#L158-L166

Do you have any thoughts about this?

If we want to keep using redis to store document versions, my suggestion would be to turn this error log into a warning and maybe let the download happen anyways and trust s3 by default ?

That alternative seems also fair :).

fflorent avatar Nov 16 '23 10:11 fflorent

My guess would be that removing the checksum entirely would be a bigger change, and that it may still have utility. Converting the error to a warning seems less disruptive.

I'd also plan to do a fair bit of testing after the change. I'm confident the original motivation for this behavior is gone, but that doesn't guarantee that removing it will be trouble-free (like removing someone's appendix).

paulfitz avatar Nov 16 '23 18:11 paulfitz

@paulfitz I opened this PR: #767

However, the more I think about it, the more I have doubts about the usefulness of this flag I proposed in this PR (GRIST_DISCARD_REDIS_CHECKSUM_MISMATCH): I don't feel like letting some administrators enable it would let us be more confident to remove it at some point and by default only warn the user. Probably it would be better letting you do your own tests.

I close the PR for now, unless you tell me that's has some interest. I wonder what are the tests you intend to do and if I can help somehow.

Happy thanksgiving!

fflorent avatar Nov 22 '23 17:11 fflorent

@paulfitz would you like us to re-open Florent's PR so we can run some tests ? I still believe that simplifying the architecture around version consistency check would be useful

vviers avatar Jan 04 '24 10:01 vviers

Sorry for missing the pings on this thread.

I think it would be worth running the tests in test/server/lib/HostedStorageManager.ts with Florent's flag set. There may be an early test in it that needs to be skipped since it specifically probes redis consistency checking, but I think the bulk should pass.

For our SaaS, we run a battery of the nbrowser tests against an AWS S3 configuration. I'd also like to confirm they pass with Florent's flag set. We could do that check.

Then I'd want to go through a period where we set the variable on the staging deployment we use for all Grist Labs work, so it gets a good work-out on real workloads against AWS S3.

At that point, as far I'm concerned, the new behavior could be made the default and the check could just be deleted.

paulfitz avatar Jan 12 '24 15:01 paulfitz

Thanks @paulfitz for your feedback! I reopened the PR https://github.com/gristlabs/grist-core/pull/767 and run the tests as you suggested. The first test failed.

I adapted the test so it checks with and without the flag, and it passes now.

fflorent avatar Feb 13 '24 11:02 fflorent

Thanks @fflorent for pushing this forward. I'm working through the steps mentioned in https://github.com/gristlabs/grist-core/issues/751#issuecomment-1889557946

  • [x] support GRIST_SKIP_REDIS_CHECKSUM_MISMATCH flag
  • [x] check full battery of Grist Lab's SaaS tests with flag set
  • [x] run Grist Lab's staging servers with flag set for a while

Once through that, we at GL will be fine with the default behavior changing.

paulfitz avatar Mar 14 '24 18:03 paulfitz

So ... Can we close this issue ? :smile:

CamilleLegeron avatar Apr 11 '24 12:04 CamilleLegeron

So ... Can we close this issue ? 😄

Yesterday, I misread the last Paul's todo item:

run Grist Lab's staging servers with flag set for a while

I don't know how much time we should wait, but probably a month is not enough. Also this issue will be closed once the default behavior is changed as Paul stated.

This PR is just a step toward this goal but not the goal itself.

fflorent avatar Apr 12 '24 07:04 fflorent

Thanks for reminder about this, working now on getting that last checkbox ticked off.

paulfitz avatar Apr 16 '24 13:04 paulfitz

@fflorent testing in https://github.com/gristlabs/grist-core/issues/751#issuecomment-1998095415 looks fine. It would be ok by Grist Labs for the default behavior to change.

paulfitz avatar Apr 22 '24 19:04 paulfitz

@paulfitz We would be happy to open a new PR to make this the default behavior.

Edit: would it be acceptable for you a PR which removes the GRIST_DISCARD_REDIS_CHECKSUM_MISMATCH env variable btw and which discards the redis checksum in all the cases?

fflorent avatar Jun 13 '24 12:06 fflorent

Edit: would it be acceptable for you a PR which removes the GRIST_DISCARD_REDIS_CHECKSUM_MISMATCH env variable btw and which discards the redis checksum in all the cases?

This is acceptable, yes. Thanks for following up on this @fflorent

paulfitz avatar Jun 20 '24 14:06 paulfitz