grist-core
grist-core copied to clipboard
Vacuum active document regularly
Context
We have noticed that some document can become very heavy, despite having reduced the history size.
Proposed solution
Requesting a vacuum significantly reduces their size. ~~This commit introduces a timer to run this job regularly and optimize the disk usage on the storage.~~ UPDATE: This commit runs a VACUUM before closing the document for inactivity (if no shutdown is already in progress)
Has this been tested?
- [ ] 👍 yes, I added tests to the test suite
- [ ] 💭 no, because this PR is a draft and still needs work
- [x] 🙅 no, because this is not relevant here
- [ ] 🙋 no, because I need help
There is a related TODO over here: https://github.com/gristlabs/grist-core/blob/cac05de159a99e6cece1a48b31748d3cfdf39fbe/app/server/lib/ActionHistoryImpl.ts#L712-L717
The strategy proposed here feels a bit aggressive? If a vacuum happens to be in progress, might a document become non-responsive for some seconds, or 10s of seconds for a large document? That's not the end of the world, but it does feel a little unpredictable?
The strategy proposed here feels a bit aggressive? If a vacuum happens to be in progress, might a document become non-responsive for some seconds, or 10s of seconds for a large document? That's not the end of the world, but it does feel a little unpredictable?
@paulfitz That's a very good point. We had initially evaluated the risk of a VACUUM taking long, but on a fast disk, and eliminating the risks too quickly. On a disk with a (very) bad write speed, this can indeed take even more than a minute.
We propose another strategy: run the VACUUM when a document is being closed for inactivity and if no shutdown is in progress. This way, we hope that's much less agressive. Also we manage to only mark a change if the document size as been reduced above 10% (and push the changes when we meet this condition in case we use S3).
Do you agree with the global idea?
In the next few days, we will work on implementing the tests.
Thinking of tests using MinIO, check whether a new version is uploaded if and only if VACUUMED version has significantly reduced size when shutting down.
We propose another strategy: run the VACUUM when a document is being closed for inactivity and if no shutdown is in progress. This way, we hope that's much less aggressive. Also we manage to only mark a change if the document size as been reduced above 10% (and push the changes when we meet this condition in case we use S3).
Do you agree with the global idea?
In the next few days, we will work on implementing the tests.
If that is sufficient for you, it certainly feels like a safer default. There are already some cleanup operations that are done more or less this way I believe. More aggressive strategies are possible, but would be less worrisome if configurable by individual operators based on their hardware.
@paulfitz I think in the very most cases, that's very sufficient :)
Thanks for your reply!
I need to test that again with a MinIO bucket. I'll come back when my tests are conclusive.
Works pretty well!
Errors seems to be related to our work:
1) ActiveDocShutdown
"after each" hook for "should not close ActiveDoc while loading":
Error: Timeout of 10000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/runner/work/grist-core/grist-core/_build/test/server/lib/ActiveDocShutdown.js)
at listOnTimeout (node:internal/timers:588:17)
at processTimers (node:internal/timers:523:7)
I can reproduce it locally.
This seems to be due to the fact that this line loads a document and waits for the formulas: https://github.com/gristlabs/grist-core/blob/3b66e0625027c10800f96ef9fe12cee6a7da6367/test/server/lib/ActiveDocShutdown.ts#L139C57-L139C64
Sooo… Can be reviewed, but it still has issues to be fixed first ⇒ converting it to draft
Errors seems to be related to our work:
1) ActiveDocShutdown "after each" hook for "should not close ActiveDoc while loading": Error: Timeout of 10000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/runner/work/grist-core/grist-core/_build/test/server/lib/ActiveDocShutdown.js) at listOnTimeout (node:internal/timers:588:17) at processTimers (node:internal/timers:523:7)I can reproduce it locally.
This seems to be due to the fact that this line loads a document and waits for the formulas: https://github.com/gristlabs/grist-core/blob/3b66e0625027c10800f96ef9fe12cee6a7da6367/test/server/lib/ActiveDocShutdown.ts#L139C57-L139C64
Sooo… Can be reviewed, but it still has issues to be fixed first ⇒ converting it to draft
Maybe a collision with initializeDocUsage? Some chance it could be doing a query when the vacuum is issued. Not sure why that would be a problem though.
Maybe a collision with initializeDocUsage? Some chance it could be doing a query when the vacuum is issued. Not sure why that would be a problem though.
@paulfitz I'll take a closer look next Monday, but something interesting I saw (IIRC): executing SELECT 1 right before the VACUUM seems to somehow fix the issue, but it is a unsatisfying hack
@paulfitz It is indeed _initializeDocUsage. But that's not because of the nature of the queries run inside, but rather because they are run without being awaited.
If you switch your workspace to this commit: e46a8af9883dc9780e2e4f8c9d23fdd231c3cd59
And if you run this command: GREP_TESTS="should not close ActiveDoc while loading" LANGUAGE=en_US yarn test:server.
It works somehow.
But if you apply this patch:
diff --git a/app/server/lib/ActiveDoc.ts b/app/server/lib/ActiveDoc.ts
index fa5e2260..9b73bebd 100644
--- a/app/server/lib/ActiveDoc.ts
+++ b/app/server/lib/ActiveDoc.ts
@@ -3283,6 +3283,7 @@ export class ActiveDoc extends EventEmitter {
// corruption.
try {
await this._initDocUsagePromise?.catch(() => {});
+ void (this.docStorage.getDB().exec("SELECT 42;")); // Same result if you replace call to `exec()` by `run()` or `get()`
await this.docStorage.vacuum();
} catch (err) {
if (err.code === "ENOENT") {
It does not work anymore, just because you ran a SELECT query before that you don't await.
Removing the limitAttach() calls makes this test pass, but of course it fails to effectively vacuum.
So there's something unsatisfying with this patch.
I now tend to think just running a SELECT 1; query that we await is a satisfying hack 😬.
I am adding comments and a specific test for that case, with the hope it will make sense.