Feature Request

Is your feature request related to a problem? Please describe:

The lightning need a volume to save intermediate file, It’s hard to predict the size of this disk， So we must prepare as large a disk as possible to install these temporary files, For example , If I need to restore 2T data, we have to prepare a 2T volume for lightning, It is a bad experience to use on the cloud。

Describe the feature you'd like:

We want to specify the volume size and the checkpoint will exceed this size in the lightning process.

Describe alternatives you've considered:

none

Teachability, Documentation, Adoption, Optimization:

Nov 04 '20 11:11 shuijing198799

by "check point file" you mean those "SST files" in the local backend?

Nov 04 '20 19:11 kennytm

by "check point file" you mean those "SST files" in the local backend?

yes。 I use immediate files instead.

Nov 05 '20 04:11 shuijing198799

Seems we can use https://pkg.go.dev/github.com/cockroachdb/pebble#DB.EstimateDiskUsage to fetch the disk usage.

Abstract

Periodically, before a WriteRows, check every engine's total estimated disk usage. If the total disk usage exceeds the "(soft) disk quota", we block the write to the largest engines until the remaining total is less than the quota, and flush the blocked engines' content into TiKV. The engine UUIDs are reused.

This will cause subsequent imports to suffer from range overlapping, which we have to accept as trade-off.

Checkpoint validity

The flushing design must be compatible with checkpoints, that is no data will be lost if we Ctrl+C → resume in the middle of a process. Checkpoints may be earlier than the actual progress, so some data (process) duplication should be acceptable and ignored.

Now let's consider the flush process:

... parallel WriteRows ...
detected quota overflow, start emergency ingest to TiKV
CloseEngine()
1. Flush()
2. saveEngineMeta()
ImportEngine()
1. readAndSplitIntoRange()
2. loop:
  - SplitAndScatterRegionByRanges()
  - WriteAndIngestByRanges()
Reset engine to empty
... parallel WriteRows ...

Let's consider what happens regarding the place of interruption (I) and actual saved checkpoint (C):

Case I=3, C<3

Currently, with Local backend, a checkpoint is flushed only when the entire engine is written because Flush() is expensive (https://github.com/pingcap/tidb-lightning/pull/326#issuecomment-638725946). So the end of step 3 is a good point to save the checkpoint.

If step 3's checkpoint is not recorded, we will restart from the beginning, while the engine contained some incomplete data. This makes us to hit step 2 quicker, and some "future" data will be ingested. But this is still fine since those duplicated KV in the future are ignored.

Case I=4, C<4

If step 4 is actually completed, all data will have been copied to TiKV. So whether C=1 (restart from scratch) or C=3 (import again) should be fine in terms of data, just slower.

Case I=5, C<5

If step 5 is actually completed, the local data is cleaned up. Starting from C=1 should be fine. Starting from C=3 or C=4 will lead to importing an empty database, which is also fine because the data are already sent to TiKV.

Considering these, it should be fine to place a checkpoint immediately before flushing, importing and resetting the engine.

Implementation

Every engine provides a StorageSize() uint64 method. TiDB and Importer backend implement that by returning 0, Local backend implement that by calculating total occupied size.
Periodically (how?), compute the StorageSize() for every engine, and sort the result in ascending order. At the point the total storage size exceeds the "quota", mark those engines for flushing.
- The "Period" depends on how expensive it is to compute StorageSize().
For every engine marked for flush,
- Acquire a write lock from the engine's "flush" RWMutex.
- Do the flush + ingest + clean, writing checkpoint in between
- If the engine is a data-engine, perform a Flush() on the corresponding index-engine too.
- Release the write lock
For every deliveryLoop,
- Before WriteRows(), try to acquire a read lock from the engine's "flush" RWMutex.
- if the read lock is immediately acquired, do WriteRows() as usual, and continue.
- otherwise, do the actual read lock acquisition.
- after the read lock is acquired, immediately write the current file offset to the checkpoint.
- do WriteRows() as usual.

Nov 06 '20 11:11 kennytm

StorageSize() seems to be fast if we calculate the size of the full range.

Also, I suggest we should maintain an approximate size which can be last calculated storage size + written bytes. The "written bytes" means the size of bytes we have written to DB since the last storage size calculation. By this way, we can avoid overwriting accidentally.

Nov 06 '20 12:11 overvenus

By this way, we can avoid overwriting accidentally.

could you elaborate how this works?

Nov 07 '20 09:11 kennytm

Support to specify a disk quota for intermediate files

Feature Request

Abstract

Checkpoint validity

Case I=3, C<3

Case I=4, C<4

Case I=5, C<5

Implementation