pg_chameleon
pg_chameleon copied to clipboard
Data lost when exceed batch size.
Describe the bug Data lost when the number of rows is exceed batch size.
To Reproduce
- set "replica_batch_size :10000" in configure file.
- insert 18000 rows in the source database.
- chameleon will replica only 10000, and 8000 rows lost.
After debuged the code(v2.0.12 or master), I found the code snippet in mysql_lib. py. code : `
- 1194 if close_batch:
- 1195 my_stream.close()
- 1196 return [master_data, close_batch]
` According the code, the rows in group_insert will not to be stored.
I changed them to `
- 1194 if close_batch:
- 1195 break
`
Then result was right.
I wasn't able to reproduce the bug as the rows you mentioned seems already ok. I've just merged some changes in master/ver2.0. Can you try again please?
In current version (tag v2.0.14) The line 1266&1267, should be a single code: break. And let the code block begin with line 1375 running, which will write the batch remaining in variable group_insert. If not , according current 1266&1267, variable group_insert will not be written. When function __read_replica_stream be called next , variable group_insert will be assign to empty array in line 1144.
I'm trying to avoid saying "me too", because I'm struggling to understand the scope of the problem we're witnessing. It appears to be the case that "some" rows in one table "sometimes" aren't updated when the source rows are. The batch sizes in question are far far lower then our max batch size (10000 in config, actual tends to be less than 200) We haven't been able identify the criteria for failure as yet, and are not entirely sure it's limited to one table (that wouldn't make sense, but it does have a large amount of text in it, so that might issue some how). I am actively looking into the problem now we speak and hope to have more concrete information later today.
OK... I think I might have found the problem, and agree with @wycwyx about the fix, but not the cause. My understanding is that this flow may occur:
- Some rows are added to
group_insert
on line 1398, but not yet written at 1405-1412. - We loop around to the next
binlogevent
at 1219 - The next event is a
QueryEvent
at 1250 and is also something that istrue
at 1257 - The function is exited at 1308 without ever writing the contents of
group_insert
I've managed to get a copy of the binary log for an example of this occurring in our production environment and whilst I think my previous point holds true, the log doesn't support it being the cause. I'd like to share the log, but it's several hundred megabytes and contains personal information which I cannot disclose.
What I see is very strange though: there are two updates to the table in question, the first wasn't applied but the second was... then some updates to a schema not replicated to PostgreSQL... and then another update to the same table, not applied... then an insert to a different table which was.
What I don't see in the log is anything in between the successful ones and failing ones that would be classed as QueryEvent
and trigger the above flow. On that note, the part about being true
on line 1257 is only required if close_batch
was previously false. If close_batch
was already true
, a QueryEvent
would drop the existing contents of group_insert
regardless.
Is it possible that mysqlbinlog
is lying to me and hiding QueryEvent
s? I didn't think these would be logged at all, but maybe it includes things I'm not thinking of and it's hiding these by default.
I dunno... I'm really confused by this, so apologies for just brain dumping on here, just sharing what I'm finding. We have a steady trickle of the problem in our production environment, apparently only to one table and I can't for the life of me recreate it outside of that.. and all the tools, logs, statuses etc suggest it's not happening.
We've now witnessed evidence of records missing (so an insert being dropped) from other tables, which suggests this is isn't limited scope problem. That makes sense as we could never explain why it would be. Unfortunately, we're also no closer to explaining why this is happening at all.
@rascalDan did you end up finding a solution? e.g. adding some configs such as replica_batch_size = 1000000
@FreCap I'm afraid not... we've never bottomed the problem and currently have a second process comparing/repairing data that's been missed... which I'd love to get rid of! (but are currently extending to cover other tables)
Could you please share a fully runnable SQL and pg_chameleon example to reproduce so we can check on my side as well?
I saw your previous description, but to get everybody on the same page and allow reproducing, I think it would be instrumental in finding a way to remove the compare/repair feature (which scares me out quite a bit).
Honestly, not really. Most of our code is just insert into foo(...) values(?, ?, ?)
or update foo set bar = ? where bar = ?
... and sometimes they don't get replicated. I appreciate that our problem might not be directly related to the batch size problem.