signalbackup-tools
signalbackup-tools copied to clipboard
Allow merging with conversstion from text file
I like the feature of merging backups
It will be great if backup could be enhanced with history from other communicators SMS, WhatsApp, Telegram etc
My idea is simple just read CSV file with chatname,timestamp,number,text and merge it with matching conversation.
What do you think? I can help with implementation of I get some hints.
After careful reading the possible options. It looks like running insert SQL query will do the job. So the only problem is conver CSV into set of SQL queries
@janisz Have you looked at the .import command? Seems to do what you're looking for: https://www.sqlite.org/cli.html#importing_csv_files
Yes, sure, I like this idea and it seems simple enough, at least for non-group text messages (without any attachments).
Note that the sms table in the database has a ton of fields, most of them will probabaly remain empty, or get default values. Testing will have to show wether this causes any problems. Just from the top of my head, minimally required fields would be number, timestamp, direction, text.
You mention Whatsapp and Telegram, do they actually export csv files? Is there documentation on their format?
Thanks!
From WhatsApp one can export chat (one chat at a time) to ascii files in the following format and can be converted to CSV (with some manipulation): m/d/yy, h:mm AM/PM - {source phone}: message
the only problem would be identify which tables need to be populated using sqllite into Signal db.
Ok, thanks. I've started a little work on this (nothing to test yet though). There are a few difficulties to overcome, but I'll get back to it as soon as I have some spare time.
I would like to at the very least support the csv files that this program outputs and something that can easily be made from that WhatsApp output @kkj9 mentions. I'm not sure what to do about that one though. It's unfortunate that the message direction (incoming/outgoing) is not in that file as the backup has no concept of which phone number is 'self' and which is the contact. And it would get even more difficult when a chat has only outgoing (or incoming) messages, a rarity but certainly not impossible. Not sure how to get around that yet.
@kkj9: Thanks for the info. Can anyone else (from another region) confirm this format? I'm especially curious whether the time format is dependent on regional settings. If it is not too complicated I would like to be able to automatically convert that time to the ms timestamp Signal expects.
The format is right but I know (source) that datetime may have different format due to system settings and this file could not be extracted in some countries (e.g. Germany)
I played with DB with following query
select g.title, r.system_display_name, s.body, * from sms as s
join recipient as r on r._id == s.address
join thread as t on t._id == s.thread_id
join groups as g on g.recipient_id == t.recipient_ids
It basically gives all data we need. So assuming we have export with:
- group name
- timestamp
- display name
- message
We should be able to find thread_id with following query
select t._id as thread_id from groups
join thread as t on recipient_id == t.recipient_ids
where title == <group_name>
We can get address with
select _id as address from recipient
where system_display_name == <display_name>
| Column | Value |
|---|---|
| _id | autoincrement |
| thread_id | select t._id as thread_id ... where title == <group_name> |
| address | select _id as address ... where system_display_name == <display_name> |
| date | <timestamp> |
| date_sent | <timestamp> |
| date_server | <timestamp> |
| read | 1 |
| type | ❓ |
| body | <message> |
other fields have their default value. So the only question is: "What the type is?
Yes, that all looks right. The type is a bitmasked number that has some information on the message type, most importantly incoming/outgoing and wether it was a secure (Signal) or unsecure (sms) message. See: https://github.com/signalapp/Signal-Android/blob/master/app/src/main/java/org/thoughtcrime/securesms/database/MmsSmsColumns.java#L31
I think it would be wise to just have a separate importWhatsappChat option to deal with this specific format. But I still have questions about it.
21.12.2019, 14:00 - John Doe: OK 😃
You mention getting the thread_id from the group.title, but where in the WhatsApp export is the group title mentioned? Also, what happens if there are newlines in the message body? What happens if there are semicolons in the author's name? Does the file format use any escape codes for these situations?
Android: I have Polish 🇵🇱 locale on my phone and export file is named in this pattern
Czat WhatsApp z <chat_name>.txt in English it's: `WhatsApp Chat - <chat_name>.txt.
iOS: exported chat is in zip named WhatsApp Chat - <chat_name>.zip with a single file _chat.txt
Where chat_name could be group name or contact display name so we need to handle that.
And 2 more things:
- Messages could be multi line so if line had no timestamp, user prefix then it's part of message from previous line.
- There could be some system messages without sender name
25.02.2019, 14:11 - Wiadomości i połączenia są w pełni zaszyfrowane. Nikt spoza czatu, nawet WhatsApp, nie może ich odczytać ani odsłuchać. Dotknij, aby dowiedzieć się więcej.
14.10.2019, 12:14 - Zmieniono ikonę grupy
[07/07/2018, 12:42:03] <diplay_name>: Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them.
- Mentions are in format of
@<phone_number>we need to check how to handle that or at least change to display name
@bepaald what I noticed is that in the Signal db the time is stored in universal time zone (UTZ)/GMT and to the user it is displayed in the local time
Thanks for the info, both of you.
@kkj9 Yeah, that's right. I'm not too worried about the format in the Signal db, I'm pretty familiar with most of it. More about the format of the whatsapp export.
To add to janisz' earlier investigation on the Signal database, I should note that texts with mentions in them do not go into the 'sms' table, but to the 'mms' table. Also, in group chats, all outgoing messages go in the 'mms' table (even if they have no media attached). The 'mms' table is otherwise pretty much the same as the 'sms' table (for our purposes), except that the 'type' column is named 'msg_box'. Also, more columns should be inspected when trying to match the name: system_display_name could be empty, then signal_profile_name should be checked.
Messages could be multi line so if line had no timestamp, user prefix then it's part of message from previous line.
So... what happens if one sends the following message:
This is one message consisting of two lines, with a datestring after the newline
25.02.2019, 14:11 - <display_name> : More text
Is it possible to determine in any way that this is actually one single message?
Also, I've seen examples online showing status messages like this:
25/09/16, 21:50 - Nick Fury created group "Avengers"
18/06/17, 22:45 - Nick Fury added you
18/06/17, 22:45 - Nick Fury added Hulk
18/06/17, 22:45 - Nick Fury added Thor
18/06/17, 22:45 - Nick Fury added Tony Stark
Now, what if Nick Fury's name was not Nick Fury, but Nick Fury:, or if there is a colon the groupname? The program will probably parse these as messages. Similarly, in group chats (where the author's name can't be inferred from the filename), if the name or the message contains a colon, there is no way to know where (at which colon) the name ends and the message begins?
I've been working on trying to parse lines like the examples I have so far, but it is a lot of work trying to determine automatically what format the date has. It would probably be simplest to just require the user to provide the proper date format as an argument. And also require the filename to be the name of the conversation in the Signal db the messages should be imported into. Then the only hard bits are dealing with weird multiline messages like I mentioned above, names/messages with colons in them and messages with mentions. The mentions shouldn't be too difficult (though it should also handle the possibility that the mentioned phone number is not actually in the Signal db), but I have not yet investigated mentions properly. They are a somewhat recent addition to Signal and all my dev-phones are tied up at the moment.
(Note, I'm really just leaving reminders for myself here, while working on this)
Since the time format in whatsapp exported chats seem to be in minutes, often messages might happen at 'exactly the same' time, Signal will have no way to sort them, and they might appear in random order after importing. Should be easy to deal with, since Signal uses time in milliseconds, so just adding a millisecond to successive messages will allow 60 * 1000 messages in the same minute to appear, simply using the order of appearance in the WA-chat file.
Just a quick update. I have something working implemented now. It's not pretty, but it seems to do the job (as best as I can with the limited txt-format). When I have time I'll start on scanning for 'mentions', then it needs proper testing (with real chats instead of the dummy ones I'm using).
If anyone wants to help bug hunting early, let me know and I can write up some instructions on using this function as it is now.
ok, good. can you send instructions? Thank You
So, get the latest version from here, then run with the following options:
--importwachat [FILENAME], where FILENAME is the exact title of the chat (a contact or group name) as it appears in your Signal conversation list, followed by '.txt'
--setwatimefmt [time format], where time format is a description of the datetime as it appears in the chat, up to and including the last character before the author's name. The time format specifiers are the same as found here: https://en.cppreference.com/w/cpp/io/manip/get_time
--setselfid [OWNID], where OWNID is the exact name by which you yourself appear in the Signal chat. Without this, it is not possible to know which of the messages in a group chat are incoming and which are outgoing.
Example:
$ cat devgroup.txt
18.06.2017, 22:45 - Devphone Blue created group "something"
21.12.2019, 14:00 - Devphone Black: OK 😃
21.12.2019, 14:01 - Devphone Black: OK 2
21.12.2019, 14:02 - Devphone Blue: A long message spanning two
lines.
21.12.2019, 14:03 - Devphone Black: OK 3 😃
21.12.2019, 14:04 - Devphone Blue: OK 4 😃
21.12.2019, 14:05 - Devphone Black: OK 5 😃
18.06.2020, 22:45 - Devphone Blue added Tony Stark
$ ./signalbackup-tools ~/PHONE/DEVsignal-2021-01-21-17-28-15.backup 280239433569600148660196116525 --importwachat devgroup.txt --setwatimefmt "%d.%m.%Y, %H:%M - " --setselfid "Devphone Black"
signalbackup-tools (./signalbackup-tools) source version 20210128.203714 (OpenSSL)
IV: (hex:) 3d db 7f 92 bd 99 d8 77 52 bc b8 8e eb f7 79 4a (size: 16)
SALT: (hex:) 5c a9 cf fb b4 35 88 75 55 f4 7a 86 48 87 b4 50 4f fe d2 ac ac 95 2d aa b6 ee 7c 3a 8f d1 f6 a0 (size: 32)
BACKUPKEY: (hex:) 79 ce 0d 59 ec aa 67 2e a7 fb 96 c2 68 9c b4 37 87 5d 29 27 31 57 8b 8d f4 9c 63 bb eb c1 0c fa (size: 32)
CIPHERKEY: (hex:) d3 ea ab 2b 06 6d b7 c6 02 b7 f7 40 70 49 9e 59 56 b7 fc a6 20 fa 24 8b 98 14 79 6c 57 14 7a d5 (size: 32)
MACKEY: (hex:) 22 d9 e0 d4 24 2d 14 22 d4 7c a8 2b 2d 98 c2 4f a3 c7 2d 2a a4 e6 e5 f7 8a e0 ca fe ee bb d9 f4 (size: 32)
COUNTER: 1037795218
Reading backup file...
FRAME 237 (100.0%)... Read entire backup file...
done!
Looking for conversation: 'devgroup'
Importing messages into thread: 2
No colon => some type of status message? SKIPPING LINE : 'Devphone Blue created group "something"'
Invalid timefmt (%d.%m.%Y, %H:%M - ) => continue previous message
No colon => some type of status message? SKIPPING LINE : 'Devphone Blue added Tony Stark'
Imported 6 messages from file 'devgroup.txt'
Exporting backup to 'signal-2021-01-28-00-00-00.backup'
Writing HeaderFrame...
Writing DatabaseVersionFrame...
Writing SqlStatementFrame(s)...
Dealing with table 'sms'... 11/11 entries...done
Dealing with table 'mms'... 9/9 entries...done
Dealing with table 'part'... 0/0 entries...
Dealing with table 'thread'... 2/2 entries...done
Dealing with table 'identities'... 4/4 entries...done
Dealing with table 'drafts'... 0/0 entries...
Dealing with table 'push'... 0/0 entries...
Dealing with table 'groups'... 2/2 entries...done
Dealing with table 'recipient'... 7/7 entries...done
Dealing with table 'group_receipts'... 8/8 entries...done
Dealing with table 'sticker'... 80/80 entries...done
Dealing with table 'storage_key'... 0/0 entries...
Dealing with table 'mention'... 2/2 entries...done
Dealing with table 'remapped_recipients'... 0/0 entries...
Dealing with table 'remapped_threads'... 0/0 entries...
Writing SharedPrefFrame(s)...
Writing Avatars...
Writing EndFrame...
Done!
Before and after (note the imported messages date is before the actual group creation):

Requirements and limitation (that I can think of):
- no media (yet? I would like to do this. are they exported in the chat?)
- no mentions yet, but I've already investigated and it shouldn't be hard to implement, just need a bit of free time.
- filename must exactly and uniquely identify an existing conversation (group or contact name) in the Signal backup
- messages have format '<timestamp><author>: <message>' where timestamp is passed as 'fmt', author exactly and uniquely identifies an existing contact in the Signal backup.
- due to all the possible languages the exported chat can appear in, the format of the time must be provided, and status messages are skipped.
- Since the exported chat does not seem to use any escape sequences for special characters, these could lead to unexpected results, for example a colon (:) in the name of an author.
- To mark messages as received (or read, with a future option) all group members must be known. For this, the chat is scanned for all authors. If a group member has never sent a message, the member will be missing from the delivery receipts in the message details.
- Similarly, since status messages are not (can not be) parsed, if a member is added to the group at the end of the chat, the member will be marked as a recipient even before the member was added. Workaround: manually split the chatfile at the point of adding or removing members.
If anyone can think of any more requirements or limitations, let me know. More importantly, if anyone can think of ways to relax or remove some of the mentioned requirements and limitations, please let me know.
After almost two years of no feedback I think it's time to close this issue. Inserting messages has come a long way since these first attempts. If anything like this is still desired, the first solution would be to convert your TXT into a json format compatible with what's used in #153. If that's not possible I think a new issue needs to be opened at this point.