whatsapp-database-merger icon indicating copy to clipboard operation
whatsapp-database-merger copied to clipboard

Tool not working since mid july due to changes in database schema

Open AlbertCalvet opened this issue 2 years ago • 22 comments

Seems like in version 2.22.7.74 the schema changed a lot, most likely to this feature of being able to switch between iOS and Android more seamlessly.

This cool author blogpost has some info to get started with the changes, the TLDR is that the messages (plural) tables have been dropped and now they are called message and message_* and there are a few more.

https://thebinaryhick.blog/2022/06/09/new-msgstore-who-dis-a-look-at-an-updated-whatsapp-on-android/

Do you think you could find some time to update the tool? Ideally it would detect the version and would be able to merge both and migrate to the new schema for the output database.

I'm very interested in this so if you say that you won't be able to look at it anytime soon I might try to give it a chance but I'm new to Kotlin, only being able to run the project in IntelliJ took me an hour LOL.

Thanks in any case for this contribution!

AlbertCalvet avatar Aug 06 '22 09:08 AlbertCalvet

Hi! I am not able to spend time on this but if you plan to work on it, I can give you some pointers.

natario1 avatar Aug 06 '22 09:08 natario1

Hi! I am not able to spend time on this but if you plan to work on it, I can give you some pointers.

Hi I can spend some time on this but I am really out of my depth in knowing any kotlin or anything of that sort. I would be happy to get any pointers from you.

I will start off with what I think I can do then some questions, I recognise the names in the schema.kt and I believe this list all the cross-talk ids you check for and the tables they reside in. I think I will be able to update this to the October Schema since I recognise them.

val messages by table( hasId = true, refs = listOf(Table.Ref("quoted_row_id", messages_quotes)), timestamp = "timestamp" Am I right in that where it looks for ids related to other tables and where to sync the number is in the refs section where the values in brackets are the column (quoted_row_id) it looks in followed by the table (messages_quotes)?

I am not sure if the need would arise but I feel like it is possible, can I add more id/table pairs in this section and it will sync those up as well? I think you already have all the various ids that may be required at least once somewhere in there.

Do you also update and offset the jid in addition to the id in messages? I cannot identify in my lack of knowledge where it does the offsets and figures out what ids need to be synchronised. My best guess is that the schema.kt files defines the relationships and the code goes off this to then do the right offset and syncing with each id/table pair listed there.

A more general question is: does the tool assume the schema/format of the database to be the same for both db files and that the order of which file is newer or older does not matter?

MrBarbie avatar Oct 13 '22 08:10 MrBarbie

Hi! Schema class has an object called object March2022. You should create another, let's say October 2022, so that the file looks like this:

Schema { object March2022 { ... } object October2022 { ... } }

You can start by making october identical to march, then make your modifications. All tables that you add (val tableName by table()...) will be processed. The library does (automatically) the best offset management I have seen online, but we need to know a few things about each table, which you will pass to the table() function:

  • hasId : whether this table has an identifier or not. I think the ID is assumed to be a long number at the first column. This worked until know, hope it doesn't break.
  • refs : as you said. Imagine two tables messages and users. If messages have a sender column that refers to the user id, the definition of messages will have to add a reference Table.Ref("sender_id", users).
  • selfRefs: some tables can reference themselves. for example, messages can be a reply to another message. We need to know about this.
  • excludes: very problematic columns that you weren't able to make them work. If you add them here they will be dropped.
  • uniques: SQL tables can have unique constraints, take this as an example : https://github.com/natario1/whatsapp-database-merger/blob/main/src/main/kotlin/dev/natario/Schema.kt#L100-L104 . The comment also shows you how to find these. It's important to avoid conflicts as we create new data.

All of these are important. Everything that's either an ID or a reference will be updated, the rest just copied.

does the tool assume the schema/format of the database to be the same for both db files Yes. We can't make it work with very different formats, otherwise we also need to introduce some schema conversion utility. But the best schema conversion utility is currently whatsapp itself, it can do this migration to newer formats.

and that the order of which file is newer or older does not matter? Currently order does not matter AFAIR.

As you test remember to update Main.kt to use Schema.October2022 instead of March2022. Ideally we want a command line option here, use the other options as a reference

natario1 avatar Oct 13 '22 10:10 natario1

Gotcha thank you very much for the explanations. They are very helpful.

The library does (automatically) the best offset management I have seen online, but we need to know a few things about each table, which you will pass to the table() function:

Yeap yours makes it the easiest to tell it which ids refer to where etc and let it do it's job. I will probably work on this since my database is a bit messy and this would be less hacky.

  • hasId : So each table will only have the first column as it's id to be refered to by other tables correct? I dont think I have come across any that breaks this as far as I have been able to poke around for the moment but there are so many table I cannot gaurantee it. I know the article highlights many of them and from my brief run through it will be fine.

  • selfRefs: So the use case for this is if another column in the table needs to increment the same amount as it's id correct?

Yes. We can't make it work with very different formats, otherwise we also need to introduce some schema conversion utility. But the best schema conversion utility is currently whatsapp itself, it can do this migration to newer formats.

Yeah I thought this would be the case. A conversion utility would definitely be way beyond me (and I imagine a nightmare to support with needing options and even a way to find what version it is) haha I was minimally hoping maybe your may be close to have the ability to select the versions.

As you test remember to update Main.kt to use Schema.October2022 instead of March2022. Ideally we want a command line option here, use the other options as a reference

Righto, I most likely wont be able to do this unless you have some code in there i can reverse engineer. Even then probably need a breather to look at it another day

Again thanks for your support, should be able to get it working again. Once I update it and submit a pull, would it auto build a release? Or will I need to figure that part out as well?

MrBarbie avatar Oct 13 '22 11:10 MrBarbie

Yeah, when you do val messages by table(refs = Table.Ref("sender_id", users), it means that messages.sender_id refers to the "id" column of users, which we assume to be the first.

selfRefs: So the use case for this is if another column in the table needs to increment the same amount as it's id correct?

When you merge dbs all IDs might change. Imagine I have a message with id 200 with a selfRef at column repliesTo pointing to message 100 (e.g. message200 is a reply to message100). When we process message200, we will update message200.id to something else to avoid conflicts with the other databases. But we must also update message200.repliesTo : here we should put the new ID of message100. Library keeps track of this.

Once I update it and submit a pull, would it auto build a release? Or will I need to figure that part out as well?

You will make the changes locally and test them on your machine. Once you know it worked, you can open the PR and we can check how to release for other people later.

natario1 avatar Oct 13 '22 12:10 natario1

Hi again, I have encountered a problem while trying to build with the new schema. I am getting the error Variable 'message' must be initialized. I assume this is because the tables to be refered to later needs to be at the top hence the specific order you have laid out the tables. However this is a problem with the new schema because they all basically cross-reference each other. jid can be loaded first but there is no order for chat and message after jid that works because chat and message reference each other. So the one that goes first will throw the error for the second one. This is followed by Type checking has run into a recursive problem. Easiest workaround: specify types of your declarations explicitly at the line where it refers to the table with that first error.

I tried to use the below to just load the ids for message so that chat can load followed by the full message table with all the references val message by table( hasId = true, ) But I get the following error Conflicting declarations: public final val message: Table, public final val message: Table Overload resolution ambiguity: public final val message: Table defined in dev.natario.Schema.October2022 public final val message: Table defined in dev.natario.Schema.October2022

Any suggestions for this? I dont have the slightest idea where I can potential change.

Also 2 questions: Is the timestamp entry there to format any column that has timestamps? Meaning I have to add each column with timestamps there or does it serve another purpose? What do you mean by Seen these columns to be inconsistent, is it just that they sometimes have NULL values or some other checks as well?

MrBarbie avatar Oct 16 '22 05:10 MrBarbie

@MrBarbie I updated code to enable circular references in tables. You can now do, for example:

val xxx: Table by table(hasId = true, refs = listOf(Table.Ref("zzz", { yyy })))
val yyy: Table by table(hasId = true, refs = listOf(Table.Ref("zzz", { xxx })))

The timestamp column has no purpose for now, but it would be nice if you add it. I wanted to support a feature like "delete all database entries between March 3rd 2021 and April 20th 2022". Knowing which column represents time, will allow this.

Seen these columns to be inconsistent

I don't remember. They gave some errors on my own databases, so I disabled consistency checks for them.

natario1 avatar Oct 18 '22 10:10 natario1

@natario1 Great thanks for updating it. Still getting complied failures though (6, 14): Redeclaration: Schema Which referes to this part sealed class Schema which is new (53, 53): Variable 'message' must be initialized' which refers to the first reference to the message table which as of this line is not declared yet (76, 42): Type checking has run into a recursive problem. Easiest workaround: specify types of your declarations explicitly And this is the first mention of the chat table which threw the error above it

(65, 23): Overload resolution ambiguity: public constructor Ref(name: String, tableProvider: () -> Table, ignoreConsistencyChecks: Boolean = ...) defined in dev.natario.Table.Ref public constructor Ref(name: String, table: Table, ignoreConsistencyChecks: Boolean = ...) defined in dev.natario.Table.Ref Not sure why this is appearing on a later line within the chat table definition refering to the message table rather than the very first mention

I have submitted a pull request incase it might help you troubleshoot. I have updated the schema to have at least all current tables correct. May add more at a later date if it runs again. But at the moment this error is deep into kotlin territory of which I am no help.

MrBarbie avatar Oct 18 '22 11:10 MrBarbie

Ok! So the code in your PR is the one that does not compile? Or did you remove the problematic lines?

natario1 avatar Oct 18 '22 11:10 natario1

It is the one that does not compile

MrBarbie avatar Oct 18 '22 11:10 MrBarbie

Ok, I fixed it for you

natario1 avatar Oct 18 '22 11:10 natario1

Oh thanks! Damn it I thought it might have been that specific formatting but I applied to both and still got the same error. Shows my complete inexperience with kotlin XD I also didnt know you could edit directly without pull requests

MrBarbie avatar Oct 18 '22 11:10 MrBarbie

@natario1 Ok managed to run it and found a whole bunch of inconsistencies which I think is what you found last time. Most of these are because of a weird line at the top of all my message table that refer to an chat id of -1 which is impossible, OR is a last something msg which may be deleted cause it refers to an older msg.

I think it was near the end as it already got to [*] processing table chat - input\msgstore_1.db: 80 rows, columns: [*list of columns*] - output\msgstore.db: 122 rows, columns: [*list of columns*] [*] chat: editing column jid_row_id (80 entries), because it references table jid which was previously edited (offset=1356, dupes=458) then it failed with the following msg Exception in thread "main" java.lang.IllegalArgumentException: Table chat depends on message but mapping was not computed. at dev.natario.MainKt$merge$1$6.invoke(Main.kt:122) at dev.natario.MainKt$merge$1$6.invoke(Main.kt:117) at dev.natario.MainKt.processReferences(Main.kt:271) at dev.natario.MainKt.merge(Main.kt:117) at dev.natario.MainKt.main(Main.kt:67)

Could this be because you havent implemented mapping for selfRefs? Cause message does have a selfRef Edit: I think this is the circular reference coming to bite us again.... if I swap the order of the message and chat table it will fail with Table message depends on chat but mapping was not computed. instead. And looking closer at the log it is the second table it works on both times. Also added the preceeding line for the above chat error that suggests it is only starting to process the erroring table because it depends on a unprocessed table.

MrBarbie avatar Oct 18 '22 12:10 MrBarbie

Do you think something like below could work in Main.kt (around line 117)? If I put if ( table.name == "jid" ) { computeIdMapping( sourceColumns = sourceColumns, sourceData = sourceData, destColumns = destColumns, destData = destData, table = schema.message\\unsure how to define message table ).also { mappings[table] = it println("Mapped message") } } between if (type == MergeType.Combine) { and processReferences( columns = sourceColumns, entries = sourceData, table = table, selfReferences = false, // can't process self references before IdMapping has been computed. Will do later getMapping = { requireNotNull(mappings[it]) { "Table $table depends on $it but mapping was not computed." } }, )

MrBarbie avatar Oct 19 '22 04:10 MrBarbie

It looks like this part of the code is not able to deal with circular references (e.g. chat table references messages, which references chat).

It can deal with self references, which is the shortest circular possible reference. But not more complex ones like those you have introduced. I'm not sure how to fix it. There should probably be two passes, e.g. transform what we have now:

schema.forEach {
   ...
   processReferences(selfReferences = false)
   computeIdMapping(...)
   applyIdMapping(...)
   processReferences(selfReferences = true)
}

Into something like:

schema.forEach {
   processReferencesForWhichWeHaveAMapping()
   computeIdMapping(...)
   applyIdMapping(...)
}
schema.forEach {
   processReferencesThatWeCouldNotProcessBeforeNowThatAllMappingsHaveBeenComputed()
}

Basically, the whole selfReferences = false/true flag was a trick to handle the simple self referentiality. But now we have new kinds of circular deps, so we need to change this part

natario1 avatar Oct 21 '22 09:10 natario1

Gotcha on the selfReferences.

I am not sure what the difference between the non-self mapping with the selfRef mapping is but the non-self one I introduced here causes it to just outright stop when it fails so we wont be able to get into the second round.

The problem with this

Into something like:

schema.forEach {
   processReferencesForWhichWeHaveAMapping()
   computeIdMapping(...)
   applyIdMapping(...)
}
schema.forEach {
   processReferencesThatWeCouldNotProcessBeforeNowThatAllMappingsHaveBeenComputed()
}

Is that for the first part we basically only have jid atm which completes no problem but once it gets to message or chat the way it is written now as I understand it, fails just checking to see if the mapping is ready ie on processReferencesForWhichWeHaveAMapping()

Can we only computeIdMapping for the message or chat so that it has the maps it wants are there. Then process it through the normal process. What I am suggesting is something like so:

schema.forEach {\\for just `jid`
    processReferencesFor`jid`WhichWeHaveAMapping()
    computeIdMapping(...) \\ do it for `chat` and `message` during `jid`
    applyIdMapping(...)
 }
 schema.forEach { \\for everything else
    processReferencesThatWeCouldNotProcessBeforeNowThatAllMappingsHaveBeenComputed()
    computeIdMapping(...)
    applyIdMapping(...)
 }

Does it not work because you have to process it first to be able to compute the mapping?

If so, is the best way to restrict the schema.forEach to certain subsets of tables through using another selfReferences = false/true flag but for these more complex references?

MrBarbie avatar Oct 21 '22 10:10 MrBarbie

Ok I hacked together something that compiles and runs and claims it has suceeded. But still veryfing it at the moment.

MrBarbie avatar Oct 22 '22 03:10 MrBarbie

Alright, I believe I have got a version working now.

It may not be a pretty solution (no special tags or anything and may require/impossible manual fixing if schema changes too much again) but that is the limit of what I can do. Anything that goes any layer deeper I cannot understand it enough to edit (probably some specific limitations I am unware of and the full options of the syntax to know what each line is doing exactly).

What I did in the end is

schema.forEach {
   processReferencesForJID_WhichIsIndepent()
   computeIdMapping(...)
   applyIdMapping(...)
}
schema.forEach { ForMessageOnlySoThatWeCanProcessChatAfter
   computeIdMapping(...)
   applyIdMapping(...)
}
schema.forEach {
   processReferencesThatWeCouldNotProcessBeforeNowThatAllMappingsHaveBeenComputed()  { Basically anything not jid}
   computeIdMapping(...)
   applyIdMapping(...)
}

#7

MrBarbie avatar Oct 23 '22 04:10 MrBarbie

so how to build/compile this October update to be runnable application?

ermech avatar Nov 10 '22 06:11 ermech

so how to build/compile this October update to be runnable application?

Do you know how to build kotlin? I followed guides for building kotlin with gradle.

I have uploaded an intial version in the fork as well

MrBarbie avatar Nov 12 '22 02:11 MrBarbie

Hi thanks for this. Gonna try it.

I have uploaded an intial version in the fork as well

By this did u mean the zip in your fork? In this link.

Azfarrr avatar Nov 14 '22 10:11 Azfarrr

By this did u mean the zip in your fork? In this link.

Correct. The tar files is just another format for it. That v0.2.0a is only up to commit 9eb0906 but functionally it is not different to the latest one as of now

MrBarbie avatar Nov 16 '22 12:11 MrBarbie