Support to specify a disk quota for intermediate files
Feature Request
Is your feature request related to a problem? Please describe:
The lightning need a volume to save intermediate file, It’s hard to predict the size of this disk, So we must prepare as large a disk as possible to install these temporary files, For example , If I need to restore 2T data, we have to prepare a 2T volume for lightning, It is a bad experience to use on the cloud。
Describe the feature you'd like:
We want to specify the volume size and the checkpoint will exceed this size in the lightning process.
Describe alternatives you've considered:
none
Teachability, Documentation, Adoption, Optimization:
by "check point file" you mean those "SST files" in the local backend?
by "check point file" you mean those "SST files" in the local backend?
yes。 I use immediate files instead.
Seems we can use https://pkg.go.dev/github.com/cockroachdb/pebble#DB.EstimateDiskUsage to fetch the disk usage.
Abstract
Periodically, before a WriteRows, check every engine's total estimated disk usage. If the total disk usage exceeds the "(soft) disk quota", we block the write to the largest engines until the remaining total is less than the quota, and flush the blocked engines' content into TiKV. The engine UUIDs are reused.
This will cause subsequent imports to suffer from range overlapping, which we have to accept as trade-off.
Checkpoint validity
The flushing design must be compatible with checkpoints, that is no data will be lost if we Ctrl+C → resume in the middle of a process. Checkpoints may be earlier than the actual progress, so some data (process) duplication should be acceptable and ignored.
Now let's consider the flush process:
- ... parallel WriteRows ...
- detected quota overflow, start emergency ingest to TiKV
- CloseEngine()
- Flush()
- saveEngineMeta()
- ImportEngine()
- readAndSplitIntoRange()
- loop:
- SplitAndScatterRegionByRanges()
- WriteAndIngestByRanges()
- Reset engine to empty
- ... parallel WriteRows ...
Let's consider what happens regarding the place of interruption (I) and actual saved checkpoint (C):
Case I=3, C<3
Currently, with Local backend, a checkpoint is flushed only when the entire engine is written because Flush() is expensive (https://github.com/pingcap/tidb-lightning/pull/326#issuecomment-638725946). So the end of step 3 is a good point to save the checkpoint.
If step 3's checkpoint is not recorded, we will restart from the beginning, while the engine contained some incomplete data. This makes us to hit step 2 quicker, and some "future" data will be ingested. But this is still fine since those duplicated KV in the future are ignored.
Case I=4, C<4
If step 4 is actually completed, all data will have been copied to TiKV. So whether C=1 (restart from scratch) or C=3 (import again) should be fine in terms of data, just slower.
Case I=5, C<5
If step 5 is actually completed, the local data is cleaned up. Starting from C=1 should be fine. Starting from C=3 or C=4 will lead to importing an empty database, which is also fine because the data are already sent to TiKV.
Considering these, it should be fine to place a checkpoint immediately before flushing, importing and resetting the engine.
Implementation
-
Every engine provides a
StorageSize() uint64method. TiDB and Importer backend implement that by returning 0, Local backend implement that by calculating total occupied size. -
Periodically (how?), compute the
StorageSize()for every engine, and sort the result in ascending order. At the point the total storage size exceeds the "quota", mark those engines for flushing.- The "Period" depends on how expensive it is to compute
StorageSize().
- The "Period" depends on how expensive it is to compute
-
For every engine marked for flush,
- Acquire a write lock from the engine's "flush" RWMutex.
- Do the flush + ingest + clean, writing checkpoint in between
- If the engine is a data-engine, perform a Flush() on the corresponding index-engine too.
- Release the write lock
-
For every deliveryLoop,
- Before WriteRows(), try to acquire a read lock from the engine's "flush" RWMutex.
- if the read lock is immediately acquired, do WriteRows() as usual, and continue.
- otherwise, do the actual read lock acquisition.
- after the read lock is acquired, immediately write the current file offset to the checkpoint.
- do WriteRows() as usual.
StorageSize() seems to be fast if we calculate the size of the full range.
Also, I suggest we should maintain an approximate size which can be last calculated storage size + written bytes. The "written bytes" means the size of bytes we have written to DB since the last storage size calculation. By this way, we can avoid overwriting accidentally.
By this way, we can avoid overwriting accidentally.
could you elaborate how this works?