Neo-Backup icon indicating copy to clipboard operation
Neo-Backup copied to clipboard

[Feature Request] Add zstd compression

Open tiziodcaio opened this issue 3 years ago • 48 comments

It is possible to add the zstd or generally more compressions algorithms?

tiziodcaio avatar Dec 19 '21 19:12 tiziodcaio

This should be possible, but the "why" question is what would interest me.

machiav3lli avatar Dec 20 '21 00:12 machiav3lli

Ahhah I asked it if is not too complicated I'm not expert but gzip has not actually the best compression ratio and it can be cool to use others. I don't if you are not intersted it was only a little idea 😅

tiziodcaio avatar Dec 20 '21 06:12 tiziodcaio

Is compression/decompression speed a bottleneck for the app? If so, will it benefit from using zstd?

opusforlife2 avatar Jan 01 '22 05:01 opusforlife2

The clue is that is not a problem really, it was a capricious feature request

tiziodcaio avatar Jan 01 '22 11:01 tiziodcaio

If you think is only a dumb idea, no problem saying it. I will close the issue without esitations ;-)

tiziodcaio avatar Jan 03 '22 11:01 tiziodcaio

Most users will prefer fast compression (e.g. gzip level 1) over an optimal but slow compression ratio. Though, this could be interesting for speed, if the phone can process the compression faster than the reduced baytes need to crawl to a slow storage (e.g. if using an rclone mount or a slow sdcard and also for syncing). However, the improved ratio is usually not that big...

I would say it depends on the available libraries, adding them might not be too complicated, if they are similar in usage to the gzip lib.

hg42 avatar Feb 11 '22 15:02 hg42

Looking forward to this feature! zstd is really fast and has a better compression ratio. So it's totally worth it to replace gzip. Also lz4 is very very fast for decompression, which could be a nice alternative. Lz4 could be about 7x faster than gzip as for decompression i think.

Leojc avatar Jun 01 '22 01:06 Leojc

I've found this apps that implement lzo and zstd compression for backups: https://github.com/XayahSuSuSu/Android-DataBackup

tiziodcaio avatar Oct 07 '22 13:10 tiziodcaio

This comment from Nils is very interesting. It seems that adding support for more compression algorithms could be much more straightforward since it would use a dependency that's already in the app.

tsiflimagas avatar Feb 01 '23 09:02 tsiflimagas

I've found this apps that implement lzo and zstd compression for backups: https://github.com/XayahSuSuSu/Android-DataBackup

I didn't look into the source, but DataBackup uses a script from someone else. I guess the compression is done via command line while NB uses a library.

Currently you can still switch between complete API/library based archiving (including tar, we call it tarapi, vs tarcmd which uses the tar command).

The compression could be moved to the command line.

However, it would not make sense to do tarapi -> compression command > encryption via api

so switching between tarapi and tarcmd would be dropped, which also removes compatibility for older backups (or the old routines still use the api versions)

hg42 avatar Feb 01 '23 17:02 hg42

Yeah, not being able to restore an older backup because the compression changed would not be a good thing.

MrEngineerMind avatar Feb 01 '23 18:02 MrEngineerMind

Major releases have often broken compatibility. It wouldn't be anything new. To know if space savings are worth it, someone needs to test zstd.

opusforlife2 avatar Feb 02 '23 19:02 opusforlife2

Using "Often" is not very accurate, and even if it was, this type of "compatibility" issue is kind of unique for this type of app.

Imagine if adobe photoshop came out with a new version in which you can no longer load legacy JPEG/BMP image files into it anymore because they are an old format.

MrEngineerMind avatar Feb 02 '23 19:02 MrEngineerMind

Apples and oranges. Apps get updated, so your old backups will eventually become irrelevant.

opusforlife2 avatar Feb 04 '23 05:02 opusforlife2

"Eventually...."

But what you are proposing is not a gradual thing...as soon as a new version of NB app switches to a different compression, ALL previous backups are unusable, which is not a good thing.

MrEngineerMind avatar Feb 04 '23 13:02 MrEngineerMind

Sure, but if maintaining compatibility hinders improvements, then it isn't worth it. The deciding factor here will likely be how beneficial zstd actually is in practice.

opusforlife2 avatar Feb 04 '23 14:02 opusforlife2

The only practical solution is to allow the user to specify the compression method when doing a backup, and have NB auto-detect the compression method of a backup when restoring.

MrEngineerMind avatar Feb 04 '23 14:02 MrEngineerMind

Not that practical when you consider that the entire maintenance burden for the alternative code paths is on the developer(s). Totally up to them if they want to spend the time and effort for that.

opusforlife2 avatar Feb 04 '23 14:02 opusforlife2

Being a programmer, I know there would be one-time effort to add support for a second compression method. But that effort would be a fraction of the effort required to do your request of switching to a totally different compression method.

And, once that work is completed, there will be little need to change it for any future versions of NB.

MrEngineerMind avatar Feb 04 '23 15:02 MrEngineerMind

Neither am I a developer, nor is this my feature request. I'm merely going by what I understand from https://github.com/NeoApplications/Neo-Backup/issues/444#issuecomment-1412473698.

opusforlife2 avatar Feb 04 '23 15:02 opusforlife2

And that #444 comment is saying that it would be a significant effort to modify NB to use that other compression method.

MrEngineerMind avatar Feb 04 '23 15:02 MrEngineerMind

The only practical solution is to allow the user to specify the compression method when doing a backup, and have NB auto-detect the compression method of a backup when restoring.

yes, that's the natural thing to do

Not that practical when you consider that the entire maintenance burden for the alternative code paths is on the developer(s). Totally up to them if they want to spend the time and effort for that.

right, it needs some restructuring, that's the biggest reason for now. And there are more important things to do. The advantage is too low compared to other things.

The maintenance is not really a problem, when the libs are matured (they just work) and the code is modularized. Up to now compression is integrated with some conditionals and this is not the way to go, if we have multiple compression methods. E.g. the autodetection doesn't exist, it's "if compressed use .gz and do gzip". This needs to use the stored compression method or the file extension instead. And what to do, if they do not match? A user could recompress it with a different algorithm, which is kind of reasonable from my POV. The usual (so called professional) approach is to forbid users to manipulate the managed data, but I would like support that (with obvious limitations). Especially for backups, there should be robust strategies. As a conclusion, I would prefer the file extension. This has to be discussed between developers.

hg42 avatar Feb 04 '23 19:02 hg42

Being a programmer

nice, so you could add it? :-)

I know there would be one-time effort to add support for a second compression method

correct, it basically making it modular

supporting more methods would be simple (given a library that supports the same Interface or at least similar enough, in this case it needs to stream the data)

And, once that work is completed, there will be little need to change it for any future versions of NB

right, here the maturity kicks in. If the lib isn't mature and has bugs, like crashing for certain data, it would create maintenance and even worse, the backup could be unusable despite it was compressed successfully.

This means, I would not like to add any new algorithm, until I have a feeling, that it's really ready for important data. Note, that many purposes for compression are not mission critical. But backup is (at least from my POV). (Note, I'm not the main developer)

hg42 avatar Feb 04 '23 19:02 hg42

"This means, I would not like to add any new algorithm, until I have a feeling, that it's really ready for important data."

I agree.

MrEngineerMind avatar Feb 04 '23 19:02 MrEngineerMind

the #444 comment is about a command line tools solution

I wanted to say that even if it's easy for DataBackup, it's another case for NB.

some of my thoughts:

  • adding compression to the tar.sh script would indeed be easy
  • using a shell executable would either need to add this to NB
  • up to now we rejected everything that needs binaries, that are not part of toybox
  • using an external executable (e.g. from Termux) would be possible
  • preferences for the compress/decompress commands would be enough
  • autodetection would still be necessary
  • how to handle it, when the user would change those commands, but there are old backups created with other commands?
  • handling all this would be much easier if we had plugins
  • because plugins are definitely on my plan (but a rather big effort), I tend to postpone all things that could be solved with plugins, at least if the work would be thrown away
  • modularization would also help the plugins, so this work would still be useful

hg42 avatar Feb 04 '23 19:02 hg42

anyone interested in this, you could also try some things to prove, if the advantage is as big as you hope:

  • the stream is root file system -> tar -> compression -> encryption -> SAF storage
  • disable compression and compare the speed and size
  • do this for different compression levels, less data also means more speed on SAF storage (or even remote)
  • note, any comparison on local file systems isn't helpful. The comparision needs to take SAF into account
  • that means faster compression doesn't help much, if the stream is slow at the end

for wizards:

  • you can simulate part of the stream by using rclone to mount e.g. an sdcard directory on internal storage, then use tar to compress a directory and stream it though a compressor command and to a file on that mount
  • report the commands, the execution times,the sizes, and the rclone mounts
  • you may also try remote directories (ssh, gdrive, etc.)

hg42 avatar Feb 04 '23 19:02 hg42

"This means, I would not like to add any new algorithm, until I have a feeling, that it's really ready for important data."

at least, this would be valid for a built in compression method, it is like a suggestion. A configurable command line or a plugin, that is from an external repository etc. would be their own decision.

hg42 avatar Feb 04 '23 20:02 hg42

This should be possible, but the "why" question is what would interest me.

  1. It beats old gzip bacisally everywhere (compression/decompression speed/time, compression ratio, linear scaling of compression depending on selected level).

https://morotti.github.io/lzbench-web -> section Evolution:

We're in the 3rd millenimum and there was surprisingly little progress in general compression in the past decades. deflate, lzma and lzo are from the 90's, the origin of lz compression traces back to at least the 70's.

Actually, it's not true that nothing happened. Google and Facebook have people working on compression, they have a lot of data and a ton to gain by shaving off a few percents here and there.

Facebook in particular has hired the top compression research scientist and rolled 2 compressors based on a novel compression approach that is doing wonder. That could very well be the biggest advance in computing in the last decade.

See zstd (medium) and lz4 (fast):

  • zstd blows deflate out of the water, achieving a better compression ratio than gzip while being multiple times faster to compress.
  • lz4 blows lzo and google snappy by all metrics, by a fair margin.

Better yet, they come with a wide range of compression levels that can adjust speed/ratio almost linearly. The slower end pushes against the other slow algorithms, while the fast end pushes against the other faster algorithms. It's incredibly friendly as a developer or a user. All it takes is a single algorithm to support (zstd) with a single tunable setting (1 to 20) and it's possible to accurately tradeoff speed for compression. It's unprecedented.

Of course one could say that gzip already offerred tunable compression levels (1-9) however it doesn't cover a remotely comparable range of speed/ratio. Not to mention that the upper half is hardly useful, it's already slow and making it slower for little benefit.


This is from bench made 6 years ago (!), and it surely improved since that time.

  1. zstd is in the Linux kernel, and kernel devs don't do random crap into it

    zstd can be used to compress the kernel itself and its modules, zram/zswap, for transparent filesystem compression (BTRFS)


This alone is the ultimate seal of approval, Linux kernel is no joke

  1. It's basically everywhere now?

    See https://en.wikipedia.org/wiki/Zstd#Usage

    A few examples:

    Repos of Linux packages are no joke too, this scenario is pretty similar to backup one:

    • fast compression/decompression -> job done faster, happy user, less battery used
    • smaller size -> less storage/bandwidth used

The fact that this issues is opened for 2 years is quite disappointing imo.

murlakatamenka avatar Feb 24 '24 09:02 murlakatamenka

As for implementation, I see that compression library used common-compress supports zstd, so I don't really get why it should be hard to use another compressor:

tar | gzip | encryption -> org.jdoe.app.tar.gz -> tar | zstd | encryption -> org.jdoe.app.tar.zst

Add compression + level

enum Compression {
    Gzip(u8)
    Zstd(u8)
}

and expose those via GUI, which at the moment exposes only level because there is only gzip. Voilà? While simplified, isn't it right?


@hg42

autodetection would still be necessary

that's not a problem, zstd-compressed stream has its own magic number, see https://datatracker.ietf.org/doc/html/rfc8878#section-3.1.1-3.2

Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528


how to handle it, when the user would change those commands, but there are old backups created with other commands?

I'd say this is overthinking. If a user is smart and advanced, he either won't do it, or won't recompress tar.gz into tar.xz and will expect NB to work just like nothing happened.

murlakatamenka avatar Feb 24 '24 09:02 murlakatamenka

  1. It beats old gzip bacisally everywhere (compression/decompression speed/time, compression ratio, linear scaling of compression depending on selected level).

the memory consumption would be an important part. Speed often comes from using more memory.

  • there are several backups running in parallel
  • phones have less memory than workstations and servers

hg42 avatar Feb 25 '24 02:02 hg42