rustic_core
rustic_core copied to clipboard
The `copy` command may copy duplicate data
The copy command may copy duplicated blob.
How to reproduce:
- Prepare a file that is large enough, for example,
ffmpeg.exe - Copy
ffmpeg.exetoffmpeg2.exe,ffmpeg3.exe - Create a config file like this:
[repository]
repository="source"
password="123456"
[[copy.targets]]
repository="target"
password="789012"
- Save the config file as
config.toml. - Create some snapshots:
rustic -P config backup ffmpeg.exe
rustic -P config backup ffmpeg2.exe
rustic -P config backup ffmpeg3.exe
- Copy source to target:
rustic -P config copy - Compare sizes of two repositories and the target repository will be larger.
- Prune the target repository and it will reported that there are unused data.
It seems that when copying data blobs in src/command/copy.rs, the index is not updated so duplicate data blobs are copied.
I realized this when copying a 300GB repo into a 512GB hard disk and encountered a surprising disk full.
Thanks a lot @9-2-1 for opening this issue!
I tried to reproduce using
rustic backup <LARGE_FILE> --init
rustic backup <LARGE_FILE> --init
rustic backup <LARGE_FILE> --init
rustic copy --init
and I am getting the following results:
- for the src repo:
rustic repoinfo
| File type | Count | Total Size |
|-----------|-------|------------|
| Key | 1 | 363 B |
| Snapshot | 3 | 1.5 kiB |
| Index | 1 | 1.1 kiB |
| Pack | 2 | 7.4 MiB |
| Total | 7 | 7.4 MiB |
| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree | 1 | 1.4 kiB | 877 B |
| Data | 17 | 20.0 MiB | 7.4 MiB |
| Total | 18 | 20.0 MiB | 7.4 MiB |
| Blob type | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs | 1 | 954 B | 954 B |
| Data packs | 1 | 7.4 MiB | 7.4 MiB |
- for the target repo:
| File type | Count | Total Size |
|-----------|-------|------------|
| Key | 1 | 363 B |
| Snapshot | 3 | 1.6 kiB |
| Index | 1 | 1.1 kiB |
| Pack | 2 | 7.4 MiB |
| Total | 7 | 7.4 MiB |
| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree | 3 | 4.3 kiB | 2.6 kiB |
| Data | 17 | 20.0 MiB | 7.4 MiB |
| Total | 20 | 20.0 MiB | 7.4 MiB |
| Blob type | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs | 1 | 2.7 kiB | 2.7 kiB |
| Data packs | 1 | 7.4 MiB | 7.4 MiB |
So, tree blobs are duplicated (and would be removed by prune).
Can you confirm this (or just give the output of rustic repoinfo for your two repositories)?
Ok, I think I found the reason:
- The copying is parallelized
- we check existence of a blob in (the repository / the already written packs / the in-flight packs) before we start with compressing and encrypting the blob data
- so, multiple identical blobs which are processed at the same time may all get the check "not yet exiting" and all get processed.
This is actually a race-condition bug, which however cannot lead to data loss but unfortunately to too much data in the repository.
And it is non-deterministic, so in my case I encountered duplicate tree blobs, while you may have encountered duplicate data blobs...
#148 does not fully solve the issue but makes it much more unlikely.
To fully solve it we need to do a big refactor of how packaging of blobs works - most likely in combination with a better and user-controllable control of the used parallelism in each step of the packing pipeline. This is something we can't do now..
So, tree blobs are duplicated (and would be removed by prune).
Can you confirm this (or just give the output of
rustic repoinfofor your two repositories)?
Data blob are duplicated too. The magic is to backup files under different names (maybe under different inode id?), like this:
rustic backup "<LARGE_FILE1>" --init
rustic backup "<LARGE_FILE2>" --init
rustic backup "<LARGE_FILE3>" --init # these file are same
rustic copy --init
The outputs of my test. 35.8 MiB of the data is duplicated, including both tree packs and data packs.
===== rustic -P test repoinfo
| File type | Count | Total Size |
|-----------|-------|------------|
| Key | 1 | 363 B |
| Snapshot | 12 | 5.9 kiB |
| Index | 3 | 5.6 kiB |
| Pack | 5 | 51.5 MiB |
| Total | 21 | 51.5 MiB |
| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree | 3 | 20.3 kiB | 10.9 kiB |
| Data | 100 | 125.6 MiB | 51.5 MiB |
| Total | 103 | 125.6 MiB | 51.5 MiB |
| Blob type | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs | 3 | 3.7 kiB | 3.7 kiB |
| Data packs | 2 | 19.2 MiB | 32.3 MiB |
===== rustic -r test2 repoinfo
| File type | Count | Total Size |
|-----------|-------|------------|
| Key | 1 | 363 B |
| Snapshot | 12 | 6.0 kiB |
| Index | 1 | 6.0 kiB |
| Pack | 4 | 87.3 MiB |
| Total | 18 | 87.3 MiB |
| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree | 12 | 81.0 kiB | 43.6 kiB |
| Data | 176 | 208.1 MiB | 87.2 MiB |
| Total | 188 | 208.2 MiB | 87.3 MiB |
| Blob type | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs | 1 | 44.2 kiB | 44.2 kiB |
| Data packs | 3 | 22.5 MiB | 32.5 MiB |
===== rustic -r test2 prune --keep-delete 30d --max-unused 0
to repack: 4 packs, 188 blobs, 87.3 MiB
this removes: 85 blobs, 35.8 MiB
to delete: 0 packs, 0 blobs, 0 B
unindexed: 0 packs, ?? blobs, 0 B
total prune: 85 blobs, 35.8 MiB
remaining: 103 blobs, 51.5 MiB
unused size after prune: 0 B (0.00% of remaining size)
packs marked for deletion: 0, 0 B
- complete deletion: 0, 0 B
- keep marked: 0, 0 B
- recover: 0, 0 B
===== rustic -r test2 repoinfo
| File type | Count | Total Size |
|-----------|-------|------------|
| Key | 1 | 363 B |
| Snapshot | 12 | 6.0 kiB |
| Index | 1 | 7.3 kiB |
| Pack | 7 | 138.8 MiB |
| Total | 21 | 138.8 MiB |
| Blob type | Count | Total Size | Total Size in Packs |
|----------------|-------|------------|---------------------|
| Tree | 3 | 20.3 kiB | 10.9 kiB |
| Data | 100 | 125.6 MiB | 51.5 MiB |
| Tree to delete | 12 | 81.0 kiB | 43.6 kiB |
| Data to delete | 176 | 208.1 MiB | 87.2 MiB |
| Total | 291 | 333.8 MiB | 138.8 MiB |
| Blob type | Pack Count | Minimum Size | Maximum Size |
|----------------------|------------|--------------|--------------|
| Tree packs | 1 | 11.1 kiB | 11.1 kiB |
| Data packs | 2 | 19.0 MiB | 32.5 MiB |
| Tree packs to delete | 1 | 44.2 kiB | 44.2 kiB |
| Data packs to delete | 3 | 22.5 MiB | 32.5 MiB |
To fully solve it we need to do a big refactor of how packaging of blobs works - most likely in combination with a better and user-controllable control of the used parallelism in each step of the packing pipeline. This is something we can't do now..
I have an idea:
First load indexes from the target repository and scan the snapshots and tree blobs to determine the list of blobs to be copied in a parallel way.
Then merge these lists (remove duplicated data ids, and calculate total size to give a better progress bar), then copy blobs in the list in a parallel way. (Tree blobs will be read twice)
I know it is difficult to write and test these code, maybe I can try to create a PR if I have some time...
@9-2-1 Can you check if your problem still exist using the latest nightly builds?
Sorry for my late response. I test with rustic v0.7.0 and the problem is likely already solved. Thank you!
... while here are still some chances to copy a few duplicated MBs.
> rustic -r test2 repoinfo
| File type | Count | Total Size |
|-----------|-------|------------|
| Key | 1 | 363 B |
| Snapshot | 3 | 1.5 kiB |
| Index | 1 | 1.6 kiB |
| Pack | 3 | 51.9 MiB |
| Total | 8 | 51.9 MiB |
| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree | 3 | 5.1 kiB | 3.0 kiB |
| Data | 28 | 51.9 MiB | 51.9 MiB |
| Total | 31 | 51.9 MiB | 51.9 MiB |
| Blob type | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs | 1 | 3.1 kiB | 3.1 kiB |
| Data packs | 2 | 14.9 MiB | 37.0 MiB |
> rustic -r test2 prune --keep-delete 30d --max-unused 0
to repack: 1 packs, 7 blobs, 14.9 MiB
this removes: 5 blobs, 14.2 MiB
to delete: 0 packs, 0 blobs, 0 B
unindexed: 0 packs, ?? blobs, 0 B
total prune: 5 blobs, 14.2 MiB
remaining: 26 blobs, 37.8 MiB
unused size after prune: 0 B (0.00% of remaining size)
packs marked for deletion: 0, 0 B
- complete deletion: 0, 0 B
- keep marked: 0, 0 B
- recover: 0, 0 B
> rustic -r test2 repoinfo
| File type | Count | Total Size |
|-----------|-------|------------|
| Key | 1 | 363 B |
| Snapshot | 3 | 1.5 kiB |
| Index | 1 | 1.7 kiB |
| Pack | 4 | 52.7 MiB |
| Total | 9 | 52.7 MiB |
| Blob type | Count | Total Size | Total Size in Packs |
|----------------|-------|------------|---------------------|
| Tree | 3 | 5.1 kiB | 3.0 kiB |
| Data | 23 | 37.8 MiB | 37.8 MiB |
| Data to delete | 7 | 14.9 MiB | 14.9 MiB |
| Total | 33 | 52.7 MiB | 52.7 MiB |
| Blob type | Pack Count | Minimum Size | Maximum Size |
|----------------------|------------|--------------|--------------|
| Tree packs | 1 | 3.1 kiB | 3.1 kiB |
| Data packs | 2 | 791.6 kiB | 37.0 MiB |
| Data packs to delete | 1 | 14.9 MiB | 14.9 MiB |