rustic_core icon indicating copy to clipboard operation
rustic_core copied to clipboard

The `copy` command may copy duplicate data

Open 9-2-1 opened this issue 1 year ago • 6 comments

The copy command may copy duplicated blob.

How to reproduce:

  1. Prepare a file that is large enough, for example, ffmpeg.exe
  2. Copy ffmpeg.exe to ffmpeg2.exe, ffmpeg3.exe
  3. Create a config file like this:
[repository]
repository="source"
password="123456"

[[copy.targets]]
repository="target"
password="789012"
  1. Save the config file as config.toml.
  2. Create some snapshots:
rustic -P config backup ffmpeg.exe
rustic -P config backup ffmpeg2.exe
rustic -P config backup ffmpeg3.exe
  1. Copy source to target: rustic -P config copy
  2. Compare sizes of two repositories and the target repository will be larger.
  3. Prune the target repository and it will reported that there are unused data.

It seems that when copying data blobs in src/command/copy.rs, the index is not updated so duplicate data blobs are copied.

I realized this when copying a 300GB repo into a 512GB hard disk and encountered a surprising disk full.

9-2-1 avatar Feb 01 '24 05:02 9-2-1

Thanks a lot @9-2-1 for opening this issue!

I tried to reproduce using

rustic backup <LARGE_FILE> --init
rustic backup <LARGE_FILE> --init
rustic backup <LARGE_FILE> --init
rustic copy --init

and I am getting the following results:

  • for the src repo:
rustic repoinfo
| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      363 B |
| Snapshot  |     3 |    1.5 kiB |
| Index     |     1 |    1.1 kiB |
| Pack      |     2 |    7.4 MiB |
| Total     |     7 |    7.4 MiB |


| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree      |     1 |    1.4 kiB |               877 B |
| Data      |    17 |   20.0 MiB |             7.4 MiB |
| Total     |    18 |   20.0 MiB |             7.4 MiB |

| Blob type  | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs |          1 |        954 B |        954 B |
| Data packs |          1 |      7.4 MiB |      7.4 MiB |
  • for the target repo:
| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      363 B |
| Snapshot  |     3 |    1.6 kiB |
| Index     |     1 |    1.1 kiB |
| Pack      |     2 |    7.4 MiB |
| Total     |     7 |    7.4 MiB |


| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree      |     3 |    4.3 kiB |             2.6 kiB |
| Data      |    17 |   20.0 MiB |             7.4 MiB |
| Total     |    20 |   20.0 MiB |             7.4 MiB |

| Blob type  | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs |          1 |      2.7 kiB |      2.7 kiB |
| Data packs |          1 |      7.4 MiB |      7.4 MiB |

So, tree blobs are duplicated (and would be removed by prune).

Can you confirm this (or just give the output of rustic repoinfo for your two repositories)?

aawsome avatar Feb 01 '24 08:02 aawsome

Ok, I think I found the reason:

  • The copying is parallelized
  • we check existence of a blob in (the repository / the already written packs / the in-flight packs) before we start with compressing and encrypting the blob data
  • so, multiple identical blobs which are processed at the same time may all get the check "not yet exiting" and all get processed.

This is actually a race-condition bug, which however cannot lead to data loss but unfortunately to too much data in the repository.

And it is non-deterministic, so in my case I encountered duplicate tree blobs, while you may have encountered duplicate data blobs...

aawsome avatar Feb 01 '24 08:02 aawsome

#148 does not fully solve the issue but makes it much more unlikely.

To fully solve it we need to do a big refactor of how packaging of blobs works - most likely in combination with a better and user-controllable control of the used parallelism in each step of the packing pipeline. This is something we can't do now..

aawsome avatar Feb 01 '24 09:02 aawsome

So, tree blobs are duplicated (and would be removed by prune).

Can you confirm this (or just give the output of rustic repoinfo for your two repositories)?

Data blob are duplicated too. The magic is to backup files under different names (maybe under different inode id?), like this:

rustic backup "<LARGE_FILE1>" --init
rustic backup "<LARGE_FILE2>" --init
rustic backup "<LARGE_FILE3>" --init # these file are same
rustic copy --init

The outputs of my test. 35.8 MiB of the data is duplicated, including both tree packs and data packs.

===== rustic -P test repoinfo

| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      363 B |
| Snapshot  |    12 |    5.9 kiB |
| Index     |     3 |    5.6 kiB |
| Pack      |     5 |   51.5 MiB |
| Total     |    21 |   51.5 MiB |


| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree      |     3 |   20.3 kiB |            10.9 kiB |
| Data      |   100 |  125.6 MiB |            51.5 MiB |
| Total     |   103 |  125.6 MiB |            51.5 MiB |

| Blob type  | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs |          3 |      3.7 kiB |      3.7 kiB |
| Data packs |          2 |     19.2 MiB |     32.3 MiB |

===== rustic -r test2 repoinfo

| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      363 B |
| Snapshot  |    12 |    6.0 kiB |
| Index     |     1 |    6.0 kiB |
| Pack      |     4 |   87.3 MiB |
| Total     |    18 |   87.3 MiB |


| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree      |    12 |   81.0 kiB |            43.6 kiB |
| Data      |   176 |  208.1 MiB |            87.2 MiB |
| Total     |   188 |  208.2 MiB |            87.3 MiB |

| Blob type  | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs |          1 |     44.2 kiB |     44.2 kiB |
| Data packs |          3 |     22.5 MiB |     32.5 MiB |

===== rustic -r test2 prune --keep-delete 30d --max-unused 0

to repack:          4 packs,        188 blobs,   87.3 MiB
this removes:                        85 blobs,   35.8 MiB
to delete:          0 packs,          0 blobs,        0 B
unindexed:          0 packs,         ?? blobs,        0 B
total prune:                         85 blobs,   35.8 MiB
remaining:                          103 blobs,   51.5 MiB
unused size after prune:        0 B (0.00% of remaining size)

packs marked for deletion:          0,        0 B
 - complete deletion:               0,        0 B
 - keep marked:                     0,        0 B
 - recover:                         0,        0 B

===== rustic -r test2 repoinfo

| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      363 B |
| Snapshot  |    12 |    6.0 kiB |
| Index     |     1 |    7.3 kiB |
| Pack      |     7 |  138.8 MiB |
| Total     |    21 |  138.8 MiB |


| Blob type      | Count | Total Size | Total Size in Packs |
|----------------|-------|------------|---------------------|
| Tree           |     3 |   20.3 kiB |            10.9 kiB |
| Data           |   100 |  125.6 MiB |            51.5 MiB |
| Tree to delete |    12 |   81.0 kiB |            43.6 kiB |
| Data to delete |   176 |  208.1 MiB |            87.2 MiB |
| Total          |   291 |  333.8 MiB |           138.8 MiB |

| Blob type            | Pack Count | Minimum Size | Maximum Size |
|----------------------|------------|--------------|--------------|
| Tree packs           |          1 |     11.1 kiB |     11.1 kiB |
| Data packs           |          2 |     19.0 MiB |     32.5 MiB |
| Tree packs to delete |          1 |     44.2 kiB |     44.2 kiB |
| Data packs to delete |          3 |     22.5 MiB |     32.5 MiB |

9-2-1 avatar Feb 01 '24 10:02 9-2-1

To fully solve it we need to do a big refactor of how packaging of blobs works - most likely in combination with a better and user-controllable control of the used parallelism in each step of the packing pipeline. This is something we can't do now..

I have an idea:
First load indexes from the target repository and scan the snapshots and tree blobs to determine the list of blobs to be copied in a parallel way. Then merge these lists (remove duplicated data ids, and calculate total size to give a better progress bar), then copy blobs in the list in a parallel way. (Tree blobs will be read twice)

I know it is difficult to write and test these code, maybe I can try to create a PR if I have some time...

9-2-1 avatar Feb 01 '24 10:02 9-2-1

@9-2-1 Can you check if your problem still exist using the latest nightly builds?

aawsome avatar Feb 03 '24 20:02 aawsome

Sorry for my late response. I test with rustic v0.7.0 and the problem is likely already solved. Thank you!

9-2-1 avatar Aug 06 '24 11:08 9-2-1

... while here are still some chances to copy a few duplicated MBs.

> rustic -r test2 repoinfo

| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      363 B |
| Snapshot  |     3 |    1.5 kiB |
| Index     |     1 |    1.6 kiB |
| Pack      |     3 |   51.9 MiB |
| Total     |     8 |   51.9 MiB |


| Blob type | Count | Total Size | Total Size in Packs |
|-----------|-------|------------|---------------------|
| Tree      |     3 |    5.1 kiB |             3.0 kiB |
| Data      |    28 |   51.9 MiB |            51.9 MiB |
| Total     |    31 |   51.9 MiB |            51.9 MiB |

| Blob type  | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs |          1 |      3.1 kiB |      3.1 kiB |
| Data packs |          2 |     14.9 MiB |     37.0 MiB |

> rustic -r test2 prune --keep-delete 30d --max-unused 0

to repack:          1 packs,          7 blobs,   14.9 MiB
this removes:                         5 blobs,   14.2 MiB
to delete:          0 packs,          0 blobs,        0 B
unindexed:          0 packs,         ?? blobs,        0 B
total prune:                          5 blobs,   14.2 MiB
remaining:                           26 blobs,   37.8 MiB
unused size after prune:        0 B (0.00% of remaining size)

packs marked for deletion:          0,        0 B
 - complete deletion:               0,        0 B
 - keep marked:                     0,        0 B
 - recover:                         0,        0 B

> rustic -r test2 repoinfo

| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      363 B |
| Snapshot  |     3 |    1.5 kiB |
| Index     |     1 |    1.7 kiB |
| Pack      |     4 |   52.7 MiB |
| Total     |     9 |   52.7 MiB |


| Blob type      | Count | Total Size | Total Size in Packs |
|----------------|-------|------------|---------------------|
| Tree           |     3 |    5.1 kiB |             3.0 kiB |
| Data           |    23 |   37.8 MiB |            37.8 MiB |
| Data to delete |     7 |   14.9 MiB |            14.9 MiB |
| Total          |    33 |   52.7 MiB |            52.7 MiB |

| Blob type            | Pack Count | Minimum Size | Maximum Size |
|----------------------|------------|--------------|--------------|
| Tree packs           |          1 |      3.1 kiB |      3.1 kiB |
| Data packs           |          2 |    791.6 kiB |     37.0 MiB |
| Data packs to delete |          1 |     14.9 MiB |     14.9 MiB |

9-2-1 avatar Aug 06 '24 11:08 9-2-1