signal_for_android_decryption
signal_for_android_decryption copied to clipboard
Attachments not extracted
Hello,
I was testing your tool but I think there is some problem with the extraction of attachments. I have a > 800MB backup, I do not know how many attachments, but many :-). The extracted folder is only 37Mb with only one image - the last sent - in attachment folder.
Do you have some idea what I can check? I am using the last version of the tool (git-cloned today) on Ubuntu 23.10, and python in a conda environment. Something like:
conda create -n signal_backup python
conda activate signal_backup
pip install -r requirements.txt
the command I use:
python decrypt_backup.py --passphrase "[...]" ../signal-2024-02-01-23-27-30.backup ../ClearText-20240201
Thanks in advance.
Hello!
Have you had a look in the other directories (e.g. stickers) or do you end up with a huge MySQL database? Does the tool produce any output about not being able to parse something? It's possible the backup format has changed again recently which would cause the decryption process to halt.
Hello!
and thanks for your answer.
Yes I have checked it: -) No warnings/errors in the output, just the percent progression -) The whole extracted directory is 37MB. -) The database is 14MB. I have no way to check if all messages are there, but I think it is reasonable to think the all messages are correctly extracted.
I have checked again with the backup from today and I achieve the same results, while I can extract correctly all attachments from a backup from February 2022.
Hmm. Well that's not very encouraging! I know the backup format changed slightly ~7 months ago (to encrypt part of the stream which previously wasn't) but nobody noticed the actual content/organisation changing. The code in Signal has changed enough (and its been long enough) that I can't at-a-glance tell if anything has happened.
I'm afraid its time to start debugging :/
A couple of quick sanity checks you could try are:
Stick a print(args.backup_file.tell())
at the end of the main()
function (bottom of the file) and see how many bytes it read from the backup file. It should be the whole file but if it isn't, that is a good place to start.
Another thing to do would be to whack a print into the code which unpacks backups and double check that it isn't being called dozens of time with the same filename or some nonsense like that.
In any case, I'm kind-of surprised that we're in this situation: the script is written relatively carefully to make it break loudly rather than silently skipping things in the event of the backup format changing in any material way :/.
Let me know how you get on!
Best of luck!
As a separate tip, the sqlite-browser
GUI tool is very handy for browsing through the database and might be a good way to look to see if the messages you expect are in there. The format may have changed a bit but there are a few hints on how to navigate it as it was a few years back here: http://jhnet.co.uk/articles/signal_backups#using-a-decrypted-backup
I can confirm a similar issue. My backup is around 4GB, mostly media I'm assuming (I hope!), where there's only one file in attachments, which is a picture I shared in a group yesterday.
Looking through the database, I can see references to the media with around 5,456 rows.
If I can provide any help in troubleshooting this, let me know.
Hmm! How bizarre! I'm afraid I've not got time to look into this at the moment so you'll have to do some digging of your own.
It's worth noting that the script will check that it has processed the whole backup file and, bugs not withstanding, I don't believe it throws anything away... The question is where things are ending up... Might be worth throwing in some prints to see what's being loaded. (Are all the files being extracted with the same name and overwriting eachother? That kind of thing!)
Thanks! Yeah, it looks like one file is being created in the attachments directory, 0.bin
, overwritten with each image looped through. I had the filename
and length
print off to confirm this. The filename remains the same, but the length value does indicate different files.
if backup_frame.HasField("attachment"):
filename = (
attachments_directory
/ f"{backup_frame.attachment.attachmentId}.bin"
)
length = backup_frame.attachment.length
print(filename,length)
~/signal/attachments/0.bin 23374
~/signal/attachments/0.bin 81639
~/signal/attachments/0.bin 47866
~/signal/attachments/0.bin 42884
~/signal/attachments/0.bin 16727
~/signal/attachments/0.bin 23855
~/signal/attachments/0.bin 31127
~/signal/attachments/0.bin 24665
~/signal/attachments/0.bin 102735
Edit: Ah right, I've changed attachmentId
above to rowID
, which is doing the trick. I'm not super familiar with python or working with these libraries, and I'm not sure if this ties in with how things were before or not, but it has given me individual images.
Was the filename previously tied in with the message ID?
Good sleuthing!
They must have changed the thing they used to identify the files (or maybe they made it implicit... I look forward to seeing what you find! If it helps, according to my notes, a few years back:
(...) the original mime type of attachments can be found in the part table in the ct column. Attachment IDs may be found in the unique_id column. The caption column contains caption text associated with the attachment. The mid column is a foreign key pointing to entries in the mms table containing the message this attachment was sent in.
Good luck and at least in the worst case you've at least got a way of getting the images out, if not their metadata in the meantime!
Thanks for sharing!
On Mon, 26 Feb 2024, at 9:53 AM, Nodeswitch wrote:
Thanks! Yeah, it looks like one file is being created in the attachments directory,
0.bin
, overwritten with each image looped through. I had thefilename
andlength
print off to confirm this. The filename remains the same, but the length value does indicate different files.if backup_frame.HasField("attachment"): filename = ( attachments_directory / f"{backup_frame.attachment.attachmentId}.bin" ) length = backup_frame.attachment.length print(filename,length)
~/signal/attachments/0.bin 23374 ~/signal/attachments/0.bin 81639 ~/signal/attachments/0.bin 47866 ~/signal/attachments/0.bin 42884 ~/signal/attachments/0.bin 16727 ~/signal/attachments/0.bin 23855 ~/signal/attachments/0.bin 31127 ~/signal/attachments/0.bin 24665 ~/signal/attachments/0.bin 102735
I'll keep digging around.— Reply to this email directly, view it on GitHub https://github.com/mossblaser/signal_for_android_decryption/issues/7#issuecomment-1963719177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABPH2USWCMDEA5BAEUJOSTYVRLS3AVCNFSM6AAAAABCVYBGRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTG4YTSMJXG4. You are receiving this because you commented.Message ID: @.***>
Here's my solution. In decrypt_backup.py
, I used
if backup_frame.HasField("attachment"):
filename = attachments_directory / f"{backup_frame.attachment.rowId}.bin"
instead of
if backup_frame.HasField("attachment"):
filename = (
attachments_directory
/ f"{backup_frame.attachment.attachmentId}.bin"
)
where .attachment.rowId
is the actual identifier in the database, and as such works well later if you want to remap the binary files.
Edit: Right, that was your solution as well @Nodeswitch , I just read your Edit 😃 .
Here's a gist to cleanly extract messages and attachments after decrypting a Signal backup using this repo: https://gist.github.com/shoufanzaid/8869fd133d2e11e3b495995a88f9f1e3.
Sorry it has taken me so long to get around to this but I've updated the script to use rowIds to name attachments.
Thanks everyone for the debugging!