Enhancement: Save/reuse clone chunk dictionaries to speed up cloning & support transmission resume

Open max-sistemich-kisters opened this issue 4 months ago • 1 comments

Hi Olle, I have beed testing bita on a project I'm working on and it is a very promising tool. Unfortunately, there are two features/improvements that are missing for my use case:

The dictionary creation is by far the slowest part of the cloning process for me. That part alone takes almost 3 minutes. The actual transmission is much faster then (5-10 seconds). If we could avoid this duplicate work, it would translate to multiple days of extra battery life on my project (assuming normal operation). While looking into this, I saw that you had already planned for a mechanism to tackle this at some point (see https://github.com/oll3/bita/issues/3). Could you tell me whether there was a specific reason to not implement this?
If the download fails due to a bad connection, I think it would be very helpful to be able to resume transmissions. My suggestion would be to save the dictionary of the downloaded chunks during cloning. So something like:
```
$ bita clone \
  --seed /dev/mmcblk0p1 \
  --seed-dictionary /path/to/archive/of/dev/mmcblk0p1 \
  http://file.to.clone \
  <output path or stream> \
  --save-dictionary /path/for/cloned/dictionary
```
Now if the download succeds, we can use the seed dictionary for a future clone. However, if it fails, we could continue the transmission by either having a clone resume option or the option to clone with multiple seeds. So something like:
```
$ bita clone \
  --seed /dev/mmcblk0p1 \
  --seed-dictionary /path/to/archive/of/dev/mmcblk0p1 \
  --seed /path/to/faild/update/or/other/file/that/is/expected/to/share/data/with/server/file \
  --seed-dictionary /path/to/failed/dictionary \
  http://file.to.clone <output path or stream> \
  --save-dictionary /path/for/cloned/dictionary
```
Of course both of these features would require the user to make sure that the seeds are not modified after dictionary creation.

Aug 07 '25 10:08 max-sistemich-kisters

Hi @max-sistemich-kisters ,

1.... Could you tell me whether there was a specific reason to not implement this?

Probably my reasoning to close #3 was that I realized that implementing a chunk index cache adds a quite a bit of complexity to the tool to be reliable.

There is a risk that the cache and actual data may mismatch. Either due to the user pointing to the incorrect cache or the seed might have been modified. Hence bita would need to validate checksums of chunks to clone in a reliable way and have some fallback for what to do on a mismatch. And validating checksums also takes time, probably less than chunk boundary scan, but still.

I guess, if the user is willing to take the risk, then chunk validation could be skipped by some cli flag and then do post validation instead. But I don't think I would like this to be the default behavior of such a feature.

I'm not completely oppose the cache thing, but I think I would rather see that the library enables users to build functionality like this in their own tools/services.

If the download fails due to a bad connection....

Yeah, I think this is almost the same answer as above... That caching stuff adds a bunch of complexity. Also there are already some parameters which can be passed to bita to do retries on connection failures.

In general I think more case specific functionality like this is better kept outside of the cli tool. Also because my time spent on bita is limited and hence I'd like to keep it to its core functionality for maintainability.

At work we use bita to create archives of filesystem images, and then we have built a small custom service (using bitar) which runs on, and updates, the software of our devices. This service also does some extra work to retry and resume on failure.

Aug 08 '25 09:08 oll3