fenix Re-enable storage maintenance call

In https://github.com/mozilla-mobile/fenix/issues/2765 we added periodic storage maintenance calls that run once a day (or so), during startup.

In https://github.com/mozilla-mobile/fenix/issues/6937 we saw that this interferes with our fennec migration code, which also runs on startup. So, maintenance was disabled.

In https://github.com/mozilla-mobile/fenix/pull/6938 I proposed a solution for this problem, but it doesn't apply now that we're running migration code in a background service.

This issue is to re-enable maintenance call in a way that doesn't interfere with other parts of the browser that may need to acquire higher levels of SQL locks (reserved+).

┆Issue is synchronized with this Jira Task

Dec 16 '19 23:12 grigoryk

@grigoryk are we tracking this anywhere? Or is this a nice-to-have in the future?

Dec 18 '19 23:12 boek

Perf self triage: this doesn't seem like it'll land in GA so moving to low impact so we don't have to track it.

Feb 24 '20 17:02 mcomella

This isn't inherently a performance problem so removing it from our board. However, the proposed solution may have perf implications so if that's the case, please let us know. Thanks!

Aug 03 '20 21:08 mcomella

From the original issue:

@grigoryk indicated that it does not [run maintenance], which will result in worse perf, more disk usage, etc.

Re-adding to our board because this may impact long term app performance.

I'd guess the solution should happen via WorkManager when the app isn't running though if it does happen while the app is running, please don't put it into the startup critical path: it should probably live in the visualCompletenessQueue if it happens during startup.

Aug 19 '20 20:08 mcomella

See: https://github.com/mozilla-mobile/fenix/issues/17373 This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 27 '21 18:03 stale[bot]

@grigoryk I noticed stalebot closed this but this seems important. What do you think the urgency/priority is on this issue? If high, do you think it's something the team would have cycles to work on?

Apr 05 '21 20:04 mcomella

Triage: let's see if we can remove the migration code to solve this problem.

Apr 21 '21 20:04 mcomella

Nowadays, places::run_maintenance does the following:

conn.execute_all(&[
        "VACUUM",
        "PRAGMA optimize",
        "PRAGMA wal_checkpoint(PASSIVE)",
    ])?;

For larger databases, VACUUM-ing could be quite expensive both in terms of IO and free disk necessary:

The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction. This means that when VACUUMing a database, as much as twice the size of the original database file is required in free disk space.

From PRAGMA optimize docs, it seems that while usually fast, regressive cases could be slow/expensive:

To achieve the best long-term query performance without the need to do a detailed engineering analysis of the application schema and SQL, it is recommended that applications run "PRAGMA optimize" (with no arguments) just before closing each database connection. Long-running applications might also benefit from setting a timer to run "PRAGMA optimize" every few hours. This pragma is usually a no-op or nearly so and is very fast. However if SQLite feels that performing database optimizations (such as running ANALYZE or creating new indexes) will improve the performance of future queries, then some database I/O may be done. Applications that want to limit the amount of work performed can set a timer that will invoke sqlite3_interrupt() if the pragma goes on for too long. Or, since SQLite 3.32.0, the application can use PRAGMA analysis_limit=N for some small value of N (a few hundred or a few thousand) to limit the depth of analyze.

Finally, PRAGMA wal_checkpoint(PASSIVE). Checkpointig is the process of moving the contents of a WAL back into the main database. Our WAL is limited to 2mb, so we know this won't be a very expensive operation. But still not cheap! In a PASSIVE mode it won't obtain a writer lock, so this should not trigger SQL_BUSY signals.

So, to summarize, I don't think any of these should run on startup. In most cases it's probably fine, but in some these operations involve expensive I/O and may fail (and cause other operations to fail that may otherwise pass) in case we're running low on disk space.

An appropriate time to run this call, I think, is when the device is idle. We could schedule a WorkManager task for this. Ideally, how often this operation is performed should be correlate with app usage - e.g. there's no point running runMaintanence task every night if the app hasn't been used in weeks, that's just a waste of resources.

Aug 19 '21 22:08 grigoryk

Also, when this call was enabled, we got the following crashes: https://sentry.prod.mozaws.net/operations/firefox-nightly/issues/6463190/events/71639081/ - Error executing SQL: UNIQUE constraint failed: moz_places.id. This is probably happening during a VAACUM... but maybe possible during checkpointing as well..?

Aug 19 '21 22:08 grigoryk

I agree with all of the above. In addition, we should prune history and history metadata. Ideally that could take some hints from the free storage space.

Aug 19 '21 22:08 mhammond

We already have a "prune history metadata" call - https://github.com/mozilla-mobile/fenix/blob/main/app/src/main/java/org/mozilla/fenix/components/Core.kt#L253. But it doesn't seem like we call pruneDestructively for history itself.

Aug 19 '21 23:08 grigoryk

@mhammond @grigoryk Is this something we should do? If we don't know, how can we find out?

Nov 17 '21 22:11 mcomella

My only concern here is when to run it. Places only has 1 write connection, so whenever we do this, other operations which write to places will either be blocked, or be delayed, until it completes. It's also going to be fairly IO intensive, so startup doesn't seem right, because we want the browser to be as snappy as possible at startup. Doing it at idle time seems right - ideally we would also try and measure any impact, but I'm not sure if suitable metrics already exist or not for that.

But ultimately we probably can't defer this forever - it's in no-one's interest for this data to grow without bounds.

Nov 18 '21 02:11 mhammond

Triage: P1 – at the very least, we should understand the impact of doing this change or not making this change, both in storage cost and performance impact.

Dec 01 '21 21:12 mcomella

fwiw, there's a bug open where we observed a 2 GB app directory in fenix: https://github.com/mozilla-mobile/fenix/issues/14646

Dec 01 '21 21:12 mcomella

Update for this issue (mostly for my own memory) after speaking with Christian: the app services team is working on a better API for this feature so that we can cancel long running maintanence tasks if the app is killed in order to avoid putting the storage layer into a bad state.

Aug 11 '22 21:08 jonalmeida

We landed an improved version in https://github.com/mozilla/application-services/issues/5011 and https://github.com/mozilla/application-services/pull/5057 - the CHANGES addition in that PR has:

--8<-- Added an optional db_size_limit parameter to run_maintenance. This can be used to set a target size for the places DB. If the DB is over that size, we'll prune a few older visits. The number of visits is very small (6) to ensure that the operation only blocks the database for a short time. The intention is that run_maintenance() is called frequently compared to how often visits are added to the places DB. --8<--

I'm not sure if that is actually interruptible though - @bendk, do you know? I think we should also add more docs to places.udl about how this works (probably just what's in that CHANGES entry. I think our hope is that it doesn't need to be interruptible due to the small amount of work it does - but the side-effect of that is that it probably does need to be called quite regularly.

We probably need a new task to actually integrate that change into Fenix, which I think we expect to help with.

Aug 11 '22 23:08 mhammond

(Oh, I meant to say that we copied the desktop maintenance strategy, both in terms of the small number of visits pruned, the target sizes specifications, etc, primarily because it seems to work well for them and us reinventing a different strategy probably just means we'd re-learn the mistakes they made on the way)

Aug 11 '22 23:08 mhammond

I didn't specifically make it interruptible, but you might be able to interrupt it via the SqlInterruptHandle on the write connection. That should interrupt any DB queries, which I think is most of the work involved. I'm happy to work with anyone who wants to improve this functionality too.

Aug 12 '22 13:08 bendk

I didn't specifically make it interruptible, but you might be able to interrupt it via the SqlInterruptHandle on the write connection. That should interrupt any DB queries, which I think is most of the work involved. I'm happy to work with anyone who wants to improve this functionality too.

I imagine we want to have a hook that feeds or monitors the application lifecycle so that we can call said interrupt when we the app is about to be killed (Application.onTerminate or would it be more appropriate to stop it during Application.onLowMemory?).

Aug 25 '22 19:08 jonalmeida

@jonalmeida What I was hoping re: interrupt and cancellation is that we can use WorkManager and the setRequiresDeviceIdle(true) (and perhaps also setRequiresBatteryNotLow(true) constraint(s) to kick off the run_maintenance call. However, when the constraint is no longer met and our task is cancelled then we'd want to interrupt the call to make sure it doesn't keep running.

Aug 25 '22 19:08 csadilek

We could land telemetry to record performance of the new call, and potentially decide that we don't need to interrupt. I would expect to see outliers though with very slow IO operations, and it would be nice to handle these cases by interrupting the operation if we can.

Aug 25 '22 20:08 csadilek

Looking through the code, I'm pretty sure that the calling interrupt() on the WriteConnection should work 99% of the time, but there are some corner cases where it would fail. The interrupt() call currently only interrupts an active sql query, so if the code is in-between queries or hasn't started running queries, then it might not properly interrupt it. Maybe this is okay though? When I combine this with Grisha's analysis, it seems like it might be a good enough solution.

We could lower the chance of failure by using an interrupt scope in the runMaintenance() call. That would make it so interrupt() would interrupt basically all calls, except for extreme edge cases where a Kotlin thread calls runMaintenance() then goes to sleep before the Rust code gets invoked, and another Kotlin thread calls interrupt() during that sleep. This wouldn't require any API changes.

If you wanted 100% reliability, we could also do it, but the API would need to be more complicated. I think it would look something like you call prepareMaintenance() when the task is scheduled and it returns an object with a runMaintenance(dbSize) method. Then interrupt() would always work if it was called after prepareMaintenance().

Aug 25 '22 20:08 bendk

OK, that's good to know, thanks! I think we can start with your proposal. We have recently exposed the interrupt call of the WriteConnection via the newcancelWrites method on the storage.

Aug 25 '22 20:08 csadilek

Thought about two more issues worth considering:

I think we'd want to use a dedicated writer for this though, so we don't interrupt/cancel the wrong operation by mistake when using a shared writer.
We should also make sure we don't run this every time the device is idle. Maybe just once a day?

Aug 29 '22 19:08 csadilek

Thought about two more issues worth considering:

* I think we'd want to use a dedicated writer for this though, so we don't interrupt/cancel the wrong operation by mistake when using a shared writer.

I don't think using a dedicated writer will be able to work without the risk of the "other" writer seeing "database is locked" errors. I do agree we need to work out how to not interrupt the wrong operation though (and I think Ben was suggesting a way to "scope" the interrupt, but I haven't thought it through)

* We should also make sure we don't run this every time the device is idle. Maybe just once a day?

The call only removes 6 visits each time it is called, so doing it once per day seems unlikely to effectively keep a cap on the number of visits.

Aug 29 '22 22:08 mhammond

The call only removes 6 visits each time it is called, so doing it once per day seems unlikely to effectively keep a cap on the number of visits.

OK, makes sense! But running it every time the device is idle might also not be ideal for battery consumption. If you all have a suggestion of how often we should ideally run this, please let us know. We could start with a compromise and introduce some upper limit of runs per day otherwise.

Aug 30 '22 01:08 csadilek

It's certainly true that we are never calling it today, so ramping up slowly should in theory have some small benefit. Another challenge is that a world where this has been in-place and being called regularly (ie, there will usually be just a few records to purge) is quite different to the world we have today (where there will be many as it catches up).

IOW, calling it daily is still better than what we do today, so let's do that! But it does open the question about telemetry etc - how will we ever know when we've found the sweet-spot?

Aug 30 '22 01:08 mhammond

I don't think using a dedicated writer will be able to work without the risk of the "other" writer seeing "database is locked" errors. I do agree we need to work out how to not interrupt the wrong operation though (and I think Ben was suggesting a way to "scope" the interrupt, but I haven't thought it through)

I didn't have a plan for how to stop this, so maybe adding another writer is the simplest way to move forward. My one concern is that the number of connections continues to grow as we add more interruptible operations. If that happens we should probably come up with a different situation.

OK, makes sense! But running it every time the device is idle might also not be ideal for battery consumption. If you all have a suggestion of how often we should ideally run this, please let us know. We could start with a compromise and introduce some upper limit of runs per day otherwise.

I like the idea of starting once per day and increasing based on telemetry. Maybe we count how many devices are still over the size target after running maintenance and try to get that down to some small-ish number.

As to how often maybe we could return a boolean if the DB is still above the size target after running maintenance. If that's false, then just run it once a day (maybe even less if the app isn't used everyday). If it's true, then run it more often, with some upper limit.

Aug 31 '22 20:08 bendk

Saw this next in our backlog but it seems like there are still discussions to take plans and a dependency in AS? Not sure what the next steps are for Fenix. @csadilek @mhammond @bendk .

Oct 10 '22 06:10 Mugurell

fenix fenix copied to clipboard

Re-enable storage maintenance call

fenix
fenix copied to clipboard