neon icon indicating copy to clipboard operation
neon copied to clipboard

Epic: Support time-based PITR

Open stepashka opened this issue 3 years ago • 8 comments

start with an RFC

  • [x] GC interval is configured in LSN (bytes), not time #1332 #1504
  • [x] GC interval is configured per pageserver, not per tenant #1332 #1504
  • [x] #1361
  • [x] make it possible to check the current config (this is needed for tests) #1332 #1504
  • [x] #1554
  • [ ] allow to specify upper bound on history size so the condition in gc is OR between time and size
  • [ ] create follow up tasks for console to configure PITR for projects (tenants)
  • [ ] warning on possible missing auth details (and users?)
Screenshot 2022-03-30 at 15 14 50

stepashka avatar Feb 23 '22 13:02 stepashka

I wonder if time and LSN criteria for PITR should be ORed or ANDed? Right now in my implementation I use AND: i.e. record is removed only if it is out of specified time interval and out of LSN range. It seems to be the safest option. But it makes it not possible to say: I want to keep changes for one month but only of them do not exceed 100Gb. Make it possible for user to specify OR/AND? It seems to be overkill... Always use OR?

knizhnik avatar Feb 28 '22 09:02 knizhnik

On 28/02/2022 11:59, Konstantin Knizhnik wrote:

I wonder if time and LSN criteria for PITR should be ORed or ANDed? Right now in my implementation I use AND: i.e. record is removed only if it is out of specified time interval and out of LSN range. It seems to be the safest option. But it makes it not possible to say: I want to keep changes for one month but only of them do not exceed 100Gb. Make it possible for user to specify OR/AND? It seems to be overkill... Always use OR?

AND seems better to me, if we have to choose. We probably don't want to even expose the LSN option to users, to keep things simple.

  • Heikki

hlinnaka avatar Feb 28 '22 10:02 hlinnaka

while the timestamp is the better choice for the customer, we may need to have an internal API that allows PITR to the given LSN, this will be needed for bug investigation or emergency data recovery; in such case LSN takes precedence

antons-antons avatar Feb 28 '22 21:02 antons-antons

I have not changed console code to make it possible to specify per-tenant parameters. But it is not possible through CLI:

zenith tenant create -c "gc_period:10 s" -c "gc_horizon:100000"

knizhnik avatar Mar 01 '22 15:03 knizhnik

Before discussion concerning PITR support scheduled for tomorrow I want to share my thought about it:

  1. Nobody will want to rollback too far in the past (i.e one week), because it actually means loosing updates for the whole week. Users may want to perform "time travel" and check state of the database a wee ago. But it is actually slightly different story: it is not a "recovery".
  2. In case of time travel far in the fast, we do not need to specify precise time, i.e. 7 days, 1 hours, 13 minutes and 33.069 seconds ago. So usually it is enough to save just one snapshot for the particular day. If size of free-tier database is limited by 10Gb, then storing 7 snapshots will require only 70Gb, which is smaller than 100Gb (suggested limit for PITR interval).
  3. Users know nothing or little about LSNs and WAL. So saying that PITR interval is limited by 100Gb says nothing to them. Except that despite of requested one week PITR interval, they still may be not able to see state of the database one week ago.

So my suggestions are:

  1. Do not public LSN limit for PITR interval. User s will specify only time range for PITR. And it should be limited let's say by 10 days. In this case we can enforce that size of users data will not exceed 100Gb.
  2. Assume that WAL (delta layers) are stored only within one day. It can be also configured if needed. Older delta layers are removed and we store only snapshots (image layers). One per day (also can be configured).
  3. Image layers older than 10 days can be removed from S3

knizhnik avatar Apr 07 '22 14:04 knizhnik

While I agree that space limitation on PITR data is not customer friendly I think it should be enforced (only) on the Free Tier. It'd be a big trust buster if a customer would want to restore a database only to discover they can no longer do that (keep in mind that database transaction log size isn't directly under the customer's controls)

onto the assumptions:

  1. Customers may restore a database to a much older time than 1 week and they want to pick the restore time. In the cloud, database restore is not always a data loss as it's not in place. Not all PITR is to get production back up and running.
  2. Same applies to time travel, we should not limit customers' ability to restore to arbitrary time.
  3. LSNs mean nothing to the customer, XIDs as well.
  4. If a customer cares about known good database state, they can create a "snapshot" (at latest or at a timestamp in the window)
  5. Image Layers are taken at different LSNs

with that

  1. Image and delta layers are garbage collected as long as they're not required for restore within a PITR window (same rule for image or delta layers)
  2. We publish PITR history size and alarm a customer when size limitation is (to be?) applied vs time limit

antons-antons avatar Apr 08 '22 00:04 antons-antons

As far as I understand we are not going to support branching from the very beginning. But do we want to support PITR? IMHO last one (recovery) is much more important than branching (especially without merge capability). But to perform recovery without branching we need to implement some other recovery mechanism.

While I agree that space limitation on PITR data is not customer friendly I think it should be enforced (only) on the Free Tier.

Despite to the fact that S3 storage is very cheap, I think it is still not acceptable to store all history. And IMHO it is very rarely needed for OLTP databases. For OLAP, scientific databases, ... yes - "reproducibility" is very important property. But even in them storing data forever is overkill.

  1. Customers may restore a database to a much older time than 1 week and they want to pick the restore time.

Why? Do you have in mind any realistic scenario?

  1. Same applies to time travel, we should not limit customers' ability to restore to arbitrary time.

There are temporal databases, "as of" queries,... but them requires something more than just getting some old snapshot of the database. Snapshots are needed mostly for 1) recovery 2) investigation of the source of some problems. Can you suggest some other use cases?

  1. LSNs mean nothing to the customer, XIDs as well.

LSN - agree. XID can be obtained by experienced DBA: either from logs, either by select xmin from.... So it can be used to restore to the particular transaction.

  1. If a customer cares about known good database state, they can create a "snapshot" (at latest or at a timestamp in the window)

Not always. Assume that some table was occasionally dropped (DBS mistake). I want to restore the latest state of the database when it exists.

  1. Image Layers are taken at different LSNs

Is it a problem? We just remove layers which were created more than N days ago if newer layer exists. Yes, it can happen that some moment we have only subset of table's data at some LSN beyond PITR interval.So if we perform rollback to this point then we get inconsistent state. But we should reject attempts tp create branch at point beyond PITR interval.

Image and delta layers are garbage collected as long as they're not required for restore within a PITR window (same rule for image or delta layers)

Do not completely understand what you want to say here (image AND delta = image OR delta?) But delta layer can be removed only if newer image layer exists.

We publish PITR history size and alarm a customer when size limitation is (to be?) applied vs time limit

IMHO it is possible but not user friendly behavior.

knizhnik avatar Apr 08 '22 12:04 knizhnik

With recent talks it looks like we need the OR behavior suggested by @knizhnik because otherwise amount of history we keep can be unpredictable and burn a lot of money on s3. I e with 10GB database size and 7 days history it can occupy more than 10 terabytes

LizardWizzard avatar Aug 23 '22 12:08 LizardWizzard

It's duplicate

vadim2404 avatar Nov 18 '22 12:11 vadim2404