Signal-Android
Signal-Android copied to clipboard
introduce chunked backups
First time contributor checklist
- [x] I have read how to contribute to this project
- [x] I have signed the Contributor License Agreement
Contributor checklist
- [x] I am following the Code Style Guidelines
- [x] I have tested my contribution on these devices:
- Oneplus 3, Android 9
- Virtual device, Android 9
- Virtual device, Android 11
- [x] My contribution is fully baked and ready to be merged as is
- [x] I ensure that all the open issues my contribution fixes are mentioned in the commit message of my first commit using the
Fixes #1234
syntax
Description
I work for bevuta IT GmbH, a company based in Germany. Our boss is a Signal user on Android. He has the problem that he can't backup his Signal app data to the SD card and despite 128 GB of internal memory, he no longer has free internal storage, mainly because of the signal backups. The most reasonable explanation was that his Android version doesn't support file systems other than FAT32 on SD cards. FAT32 has some limitations, such as a maximum file size of ~4 GB. His Signal installation uses more than 4 GB. To us, the use of a sd card for backups also seems advantageous, as it is easier to get at the data if the device breaks down.
We discussed several solutions. The best solution we came up with, is to just chunk the Signal backup into parts smaller than 4 GB, if desired. I was tasked to implement it. To use this code in a real Signal installation, it has to be in the official release. Now, here is my MR! :-)
@Roghetti This sounds amazing, one feature i am really looking forward to!
I do have a question, could you explain how it is beeing chunked? I had the idea some time ago that it would be amazing to create "yearly" backups to achieve the same. Reason for that is it would allow for a partial restore of a time range (without loosing the rest, if the key does not change)
Also do the chunks change between runs? Eg. does only the last chunk change because it is the one containing the new data?
@Roghetti This sounds amazing, one feature i am really looking forward to!
Thanks! :-)
I do have a question, could you explain how it is beeing chunked?
It's just like a normal backup, but split into multiple files, if a single file would exceed a certain size. E.g., if your normal backup has 1.5 GB, the chunked backup would create two files, one with 1GB and one with 0.5GB as 1GB is the threshold I chose.
If you combine these files, e.g. with cat
on the command line, you would get the same file as the 'normal' backup mechanism would give you.
I had the idea some time ago that it would be amazing to create "yearly" backups to achieve the same. Reason for that is it would allow for a partial restore of a time range (without loosing the rest, if the key does not change) Also do the chunks change between runs? Eg. does only the last chunk change because it is the one containing the new data?
All chunks are affected when certain data has changed. As this chunking is like the normal backup, but with multiple files, it depends in which order the data is exported. The current code exports each table of the sqlite-database, one by one, e.g., first MMS, then SMS, then Reactions and so on. If you get an MMS, all chunks (at least at the point with the new MMS) will change. I like your idea of incremental backups, but it has to be addressed in a separate Pull Request and would affect the 'normal' and the chunked backup mechanism.
Ahhh thats a shame, i had hoped that this could reduce the data transferred each night and reduce wear on my flash-storage. Thanks anyway!
Hey there, appreciate the PR. I'm not sure if we'll pull this in or not. I'll bring it up with the team.
@cody-signal There is also an discussion in the forum regarding this. I think we need to somehow adress this, because backups can easily grow beyond 4Gb, which is problematic for different reasons mentioned in the discussion
The following feature request is related: Backup to FAT card
Will the backups also be incremental? The actual implementation which forces full backup every day is really dumb :slightly_frowning_face:
@HyperCriSiS Sadly no, this will only 'split' the backup-file into parts. I hope that they will switch to incremental backups eventually. There is a linked discussion.
However, changing the backup-format should not be done very often, so we should do it properly. I think this pr should not be merged, since it fixes the symptom (backups beeing over 4Gb) and not the issue (missing segmentation/incremental backups) of ballooning backup-files
I think this pr should not be merged, since it fixes the symptom (backups beeing over 4Gb) and not the issue (missing segmentation/incremental backups) of ballooning backup-files.
I would not agree to that. Chunking would help people with devices with small internal storage but an SD-Card slot, if the Signal container size grows over 4GB and so most probably also the backup file size. Without chunking it needs 3 times >4GB (2 x existing backups + temporarily other copy until the new backup finished). The backup is encrypted so no problem to put it there. That is also the reason this issue is described initially.
Looking forward that it gets merged.
@Smojo I dont disagree with you that the growing filesize is a problem (it is, and will get even worse with time)
However, the underlying issue is not that some filesystem has limitations that chunking needs to work around, but that signal-backup files grow in size indefinitively.
Chunking DOES fix that, but only the symptom, not the problem.
I would like to see a solution where a single backup is split by year, and then only made 'on demand' (do we ALWAYS have to store chats again for 2016?)
This would solve the described issue AND the problem (as long as someone does not add more than 4Gb per year, but i deem that unlikely)
Yeah, this might be something, which is needed and helps as well.
Right now (not sure if I'm wrong) there is only a possibility to shorten chats by message-count (which also somehow results in a smaller container and so also smaller backup-file[s]), but nobody can really tell what this means in regards of "message retention" of a single chat. Some chats will not even go back a year or so if they are very talkative.
So I would agree that a chunking by time, which results in a chunked container AND backup-files (which might also have a file-size limit of 4GB in addition to the split by time), will also solve the users problem described here. (without further thinking about symptom and problem ^^)
Generally I would ask: Is there already
... a solution where a single backup is split by year ...
? If not why not merge this PR and at least mitigate the problem for some users now, even if not the 100% solution and addressing the "root" problem?
I would like to see a solution where a single backup is split by year, and then only made 'on demand' (do we ALWAYS have to store chats again for 2016?)
I think what the problem is is in the eye of the beholder. Incremental backups have the risk that an increment will break unnoticed. If, like me, you want to keep all your messages, the solution implemented here is generally reasonable and complete in itself. If you don't want to have old messages anymore, you could delete them in the application, then they wouldn't be part of a full backup anymore. Different people, different needs. Either way, this solution should be better than none. If you implement something better that would be all the cooler of course. But no reason to wait right?
I would be very happy if this PR would be merged. Then I wouldn't have to delete backups by hand every day 😃
@lieblingsnerd Sure, different users have different needs. But there is an extremely good reason to not merge this/wait for now.
As far as i can see it, signal dev's are quite conservative in terms of new features, and this will certainly be a breaking change (for backups). If the backup-system is getting changes, they will not do so lightly, and not often. That means waiting and taking time to really think through the issue is a very good reason to wait.
Also: i do not consider 'delete some old messages' to be a valid solution. Backups are there because i DO want to keep old messages, deleting them cannot be the answer.
And afaik you still need to delete them, this pr only splits the backup into multiple files that are smaller than the filesystem boundaries.
I see it not as breaking change. The current status of the backup system is really bad because it is incomplete. Since some days Signal even looses the permission to the folder. I have to re-enable it every few days.
@HyperCriSiS Any major change to the way backups are created are possibly a breaking change. This pr splits up the singular backup into multiple files. That requires long term support, and if not done correctly, may render old backups unusable. Therefore it could be breaking.
However, your problem does not seem related at all, have you checked existing issues and opened a proper ticket for your problem if no ticket exists?
@lieblingsnerd
Different people, different needs.
Agree!
Then I wouldn't have to delete backups by hand every day smiley
Agree ... doing the same as I'm out of storage + I cannot use the SD-card because of 4GB limit
@HyperCriSiS
Signal even looses the permission to the folder. I have to re-enable it every few days.
Also had this issue, but this PR will probably not solve that one ^^
I see it not as breaking change.
Agree. Why is a file chunking a breaking change? (just writing that realizing that I have no clue about coding ... still it will not break the current backup logic / way of working ... all the UI stuff can stay imho).
Still @newhinton might be correct in regards of backward compatibility etc. If so I'm sure @Roghetti can comment on that.
I would be very happy if this PR would be merged.
+1 Again: Why not merge this PR and at least mitigate the problem for some users now, even if not the 100% solution and addressing the "root" problem?
@Smojo I am not against merging this pr. I am just advocating on doing it properly the first time, and make it work properly for everyone without stopgap measures. We should not change such important parts of an application more often than we need to.
In my opinion, just chunking only fixes a minor part of the issue, because big files (chunked or not) still cannot properly be moved to cloud-storage or remote storage in general.
In the end the dev's need to find out how they would like to adress this issue, and i can only hope that it will be a solution that also satisfies my usecase, aswell as yours. If it only helps people with a 4GB limit, i will still be happy for you ;)
Edit: Regarding the chunking:
Disclaimer: i have not tested this pr. But as i said: It MAY be. When you do such a change you have to think about the long term consequences: If we introduce chunking, how deals signal with "oldschool"-backups? Will they be incompatible? Do we need to support both? If not, how do we deal with people that want to restore older ones? If we do, for how long? Those are all questions that the signal dev's need to answer, because they will have to support all of this after the original creator of this may have parted with the signal-project. Those are questions that every developer has to answer at some time, and they are not trivial to answer. This is also why sometimes seemingly easy fixes are not introduced immediately but take time to consider the implications.
And also, as someone else here stated already, what exactly is the 'root' cause? is it FAT with it's limitation, or is it the files that are to big? This is not necessarily a technical question but an user experience one, and that depends on your view ;)
There is also this comment:
https://github.com/signalapp/Signal-Android/issues/11509#issuecomment-1097142650
big files (chunked or not) still cannot properly be moved to cloud-storage or remote storage in general.
Isn't a chunking every 1GB (@Roghetti explained that already) at least one way to deal with that? It is an imho okayish size-limit for the chunks and/or for "bigger" files on Smartpones + also to copy where-ever you want to copy. Or in other words: I would not say that the chunks are big files ;-) - the current backups are.
what exactly is the 'root' cause? is it FAT with it's limitation, or is it the files that are to big? This is not necessarily a technical question but an user experience one, and that depends on your view
I can just repeat myself (and close with the same sentence). There is for sure other stuff to be solved, but it feels like this PR should not be merged (and/or discussed forever) just because it does not address all the problems and/or potential root cause(s) out there. FAT is the default of Android for portable storage. We don't need to discuss it's limit for files bigger than 4GB or push the problem to it.
Also I don't see that the discussion that it is bad that the signal message container is big and/or the backup is even bigger (for whatever reason - missing dedup for media in backup like explained in the comment etc.) and/or why people want to keep messages and/or on which ways messages can be deleted and/or how long older backups should be supported leads to something.
Just the opinion of one user who is struggling with this. (next to @Roghetti / @HyperCriSiS / and @lieblingsnerd )
To clarify a bit: The backup-chunking is an additional backup option, introduced with this Pull Request. So, you can make traditional single-file backups and chunked backups – as you wish. The backup mechanism is already the same and uses the same code: I.e. it isn't much of a burden to maintain this. Technically we use a custom Outputstream class there which does the chunking instead of the normal FileOutputStream. This leads also to a compatibility in another way: (1) You can take a normal backup, split it up on the command line and import it in Signal (when this PR is merged). (2) You can take a chunked backup, cat it on the command line into a single file and import it on a regular Signal installation.
I don't see any option for this right now. Is this coming in a future release?
That pull would make my life so much easier, as I currently can use just FAT32 for Signal backups on my external SDCard on Android + the backup exeeds a 4 GiB file size. It would be great if the devs could consider merging this one 👍 .
Thank you!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The is still a need for this functionality.
Please someone to merge this! <3
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We won't be merging this as-is, but we do have some plans for trying to do some of this work in the coming weeks. I'll leave this open until we get around to it. Thanks!
Good news, @greyson-signal <3 Please keep us informed of the progress and the way you want to do it!
If there is any plan in this process to allow even basic incremental backups (eg. chunking + rotating selection from tables based on time periods), then I would be happy to help on any aspect of the Android/common coding or testing side of things.
I have not looked at the data side of things, but a naive approach would seem to me to:
- find all records <= (year1 = min(year))
- backup via chunking
- Finish the backup and start a 'new' backup to new base name
- find all records from (year-1 to (year2=(year1-1))
- backup via chunking
- ditto all the way to yearN and year(N+1)
- subsequent backups read old chunks and write new temporary chunks, discarding the temp chunks if the old chunk matches. Discard the old chunk if a discrepancy is found.
This should result in a single current and consistent backup, and only result in 'new' files for the recent year (unless the record formats have changed or data has been added for prior years). One could also chunk by month or some other user-defined interval.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.