signal_for_android_decryption icon indicating copy to clipboard operation
signal_for_android_decryption copied to clipboard

Attachments not extracted

Open 0rC0 opened this issue 1 year ago • 10 comments

Hello,

I was testing your tool but I think there is some problem with the extraction of attachments. I have a > 800MB backup, I do not know how many attachments, but many :-). The extracted folder is only 37Mb with only one image - the last sent - in attachment folder.

Do you have some idea what I can check? I am using the last version of the tool (git-cloned today) on Ubuntu 23.10, and python in a conda environment. Something like:

conda create -n signal_backup python
conda activate signal_backup
pip install -r requirements.txt

the command I use:

python decrypt_backup.py --passphrase "[...]" ../signal-2024-02-01-23-27-30.backup ../ClearText-20240201

Thanks in advance.

0rC0 avatar Feb 01 '24 23:02 0rC0

Hello!

Have you had a look in the other directories (e.g. stickers) or do you end up with a huge MySQL database? Does the tool produce any output about not being able to parse something? It's possible the backup format has changed again recently which would cause the decryption process to halt.

mossblaser avatar Feb 02 '24 14:02 mossblaser

Hello!

and thanks for your answer.

Yes I have checked it: -) No warnings/errors in the output, just the percent progression -) The whole extracted directory is 37MB. -) The database is 14MB. I have no way to check if all messages are there, but I think it is reasonable to think the all messages are correctly extracted.

I have checked again with the backup from today and I achieve the same results, while I can extract correctly all attachments from a backup from February 2022.

0rC0 avatar Feb 02 '24 17:02 0rC0

Hmm. Well that's not very encouraging! I know the backup format changed slightly ~7 months ago (to encrypt part of the stream which previously wasn't) but nobody noticed the actual content/organisation changing. The code in Signal has changed enough (and its been long enough) that I can't at-a-glance tell if anything has happened.

I'm afraid its time to start debugging :/

A couple of quick sanity checks you could try are:

Stick a print(args.backup_file.tell()) at the end of the main() function (bottom of the file) and see how many bytes it read from the backup file. It should be the whole file but if it isn't, that is a good place to start.

Another thing to do would be to whack a print into the code which unpacks backups and double check that it isn't being called dozens of time with the same filename or some nonsense like that.

In any case, I'm kind-of surprised that we're in this situation: the script is written relatively carefully to make it break loudly rather than silently skipping things in the event of the backup format changing in any material way :/.

Let me know how you get on!

Best of luck!

mossblaser avatar Feb 02 '24 19:02 mossblaser

As a separate tip, the sqlite-browser GUI tool is very handy for browsing through the database and might be a good way to look to see if the messages you expect are in there. The format may have changed a bit but there are a few hints on how to navigate it as it was a few years back here: http://jhnet.co.uk/articles/signal_backups#using-a-decrypted-backup

mossblaser avatar Feb 02 '24 19:02 mossblaser

I can confirm a similar issue. My backup is around 4GB, mostly media I'm assuming (I hope!), where there's only one file in attachments, which is a picture I shared in a group yesterday.

Looking through the database, I can see references to the media with around 5,456 rows.

If I can provide any help in troubleshooting this, let me know.

Nodeswitch avatar Feb 26 '24 06:02 Nodeswitch

Hmm! How bizarre! I'm afraid I've not got time to look into this at the moment so you'll have to do some digging of your own.

It's worth noting that the script will check that it has processed the whole backup file and, bugs not withstanding, I don't believe it throws anything away... The question is where things are ending up... Might be worth throwing in some prints to see what's being loaded. (Are all the files being extracted with the same name and overwriting eachother? That kind of thing!)

mossblaser avatar Feb 26 '24 07:02 mossblaser

Thanks! Yeah, it looks like one file is being created in the attachments directory, 0.bin, overwritten with each image looped through. I had the filename and length print off to confirm this. The filename remains the same, but the length value does indicate different files.

            if backup_frame.HasField("attachment"):
                filename = (
                    attachments_directory
                    / f"{backup_frame.attachment.attachmentId}.bin"
                )
                length = backup_frame.attachment.length
                print(filename,length)
~/signal/attachments/0.bin 23374
~/signal/attachments/0.bin 81639
~/signal/attachments/0.bin 47866
~/signal/attachments/0.bin 42884
~/signal/attachments/0.bin 16727
~/signal/attachments/0.bin 23855
~/signal/attachments/0.bin 31127
~/signal/attachments/0.bin 24665
~/signal/attachments/0.bin 102735

Edit: Ah right, I've changed attachmentId above to rowID, which is doing the trick. I'm not super familiar with python or working with these libraries, and I'm not sure if this ties in with how things were before or not, but it has given me individual images.

Was the filename previously tied in with the message ID?

Nodeswitch avatar Feb 26 '24 09:02 Nodeswitch

Good sleuthing!

They must have changed the thing they used to identify the files (or maybe they made it implicit... I look forward to seeing what you find! If it helps, according to my notes, a few years back:

(...) the original mime type of attachments can be found in the part table in the ct column. Attachment IDs may be found in the unique_id column. The caption column contains caption text associated with the attachment. The mid column is a foreign key pointing to entries in the mms table containing the message this attachment was sent in.

Good luck and at least in the worst case you've at least got a way of getting the images out, if not their metadata in the meantime!

Thanks for sharing!

On Mon, 26 Feb 2024, at 9:53 AM, Nodeswitch wrote:

Thanks! Yeah, it looks like one file is being created in the attachments directory, 0.bin, overwritten with each image looped through. I had the filename and length print off to confirm this. The filename remains the same, but the length value does indicate different files.

        if backup_frame.HasField("attachment"):
            filename = (
                attachments_directory
                / f"{backup_frame.attachment.attachmentId}.bin"
            )
            length = backup_frame.attachment.length
            print(filename,length)

~/signal/attachments/0.bin 23374 ~/signal/attachments/0.bin 81639 ~/signal/attachments/0.bin 47866 ~/signal/attachments/0.bin 42884 ~/signal/attachments/0.bin 16727 ~/signal/attachments/0.bin 23855 ~/signal/attachments/0.bin 31127 ~/signal/attachments/0.bin 24665 ~/signal/attachments/0.bin 102735 I'll keep digging around.

— Reply to this email directly, view it on GitHub https://github.com/mossblaser/signal_for_android_decryption/issues/7#issuecomment-1963719177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABPH2USWCMDEA5BAEUJOSTYVRLS3AVCNFSM6AAAAABCVYBGRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTG4YTSMJXG4. You are receiving this because you commented.Message ID: @.***>

mossblaser avatar Feb 26 '24 11:02 mossblaser

Here's my solution. In decrypt_backup.py, I used

            if backup_frame.HasField("attachment"):
                filename = attachments_directory / f"{backup_frame.attachment.rowId}.bin"

instead of

            if backup_frame.HasField("attachment"):
                filename = (
                    attachments_directory
                    / f"{backup_frame.attachment.attachmentId}.bin"
                )

where .attachment.rowId is the actual identifier in the database, and as such works well later if you want to remap the binary files.

Edit: Right, that was your solution as well @Nodeswitch , I just read your Edit 😃 .

shoufanzaid avatar Apr 29 '24 02:04 shoufanzaid

Here's a gist to cleanly extract messages and attachments after decrypting a Signal backup using this repo: https://gist.github.com/shoufanzaid/8869fd133d2e11e3b495995a88f9f1e3.

shoufanzaid avatar Apr 29 '24 02:04 shoufanzaid

Sorry it has taken me so long to get around to this but I've updated the script to use rowIds to name attachments.

Thanks everyone for the debugging!

mossblaser avatar May 03 '24 20:05 mossblaser