pg_chameleon icon indicating copy to clipboard operation
pg_chameleon copied to clipboard

Data lost when exceed batch size.

Open wycwyx opened this issue 4 years ago • 10 comments

Describe the bug Data lost when the number of rows is exceed batch size.

To Reproduce

  1. set "replica_batch_size :10000" in configure file.
  2. insert 18000 rows in the source database.
  3. chameleon will replica only 10000, and 8000 rows lost.

After debuged the code(v2.0.12 or master), I found the code snippet in mysql_lib. py. code : `

  • 1194 if close_batch:
  • 1195 my_stream.close()
  • 1196 return [master_data, close_batch]

` According the code, the rows in group_insert will not to be stored.

I changed them to `

  • 1194 if close_batch:
  • 1195 break

`

Then result was right.

wycwyx avatar May 21 '20 09:05 wycwyx

I wasn't able to reproduce the bug as the rows you mentioned seems already ok. I've just merged some changes in master/ver2.0. Can you try again please?

the4thdoctor avatar Jul 25 '20 06:07 the4thdoctor

In current version (tag v2.0.14) The line 1266&1267, should be a single code: break. And let the code block begin with line 1375 running, which will write the batch remaining in variable group_insert. If not , according current 1266&1267, variable group_insert will not be written. When function __read_replica_stream be called next , variable group_insert will be assign to empty array in line 1144.

wycwyx avatar Aug 14 '20 03:08 wycwyx

I'm trying to avoid saying "me too", because I'm struggling to understand the scope of the problem we're witnessing. It appears to be the case that "some" rows in one table "sometimes" aren't updated when the source rows are. The batch sizes in question are far far lower then our max batch size (10000 in config, actual tends to be less than 200) We haven't been able identify the criteria for failure as yet, and are not entirely sure it's limited to one table (that wouldn't make sense, but it does have a large amount of text in it, so that might issue some how). I am actively looking into the problem now we speak and hope to have more concrete information later today.

rascalDan avatar Mar 24 '21 10:03 rascalDan

OK... I think I might have found the problem, and agree with @wycwyx about the fix, but not the cause. My understanding is that this flow may occur:

  • Some rows are added to group_insert on line 1398, but not yet written at 1405-1412.
  • We loop around to the next binlogevent at 1219
  • The next event is a QueryEvent at 1250 and is also something that is true at 1257
  • The function is exited at 1308 without ever writing the contents of group_insert

rascalDan avatar Mar 24 '21 11:03 rascalDan

I've managed to get a copy of the binary log for an example of this occurring in our production environment and whilst I think my previous point holds true, the log doesn't support it being the cause. I'd like to share the log, but it's several hundred megabytes and contains personal information which I cannot disclose.

What I see is very strange though: there are two updates to the table in question, the first wasn't applied but the second was... then some updates to a schema not replicated to PostgreSQL... and then another update to the same table, not applied... then an insert to a different table which was.

What I don't see in the log is anything in between the successful ones and failing ones that would be classed as QueryEvent and trigger the above flow. On that note, the part about being true on line 1257 is only required if close_batch was previously false. If close_batch was already true, a QueryEvent would drop the existing contents of group_insert regardless.

Is it possible that mysqlbinlog is lying to me and hiding QueryEvents? I didn't think these would be logged at all, but maybe it includes things I'm not thinking of and it's hiding these by default.

I dunno... I'm really confused by this, so apologies for just brain dumping on here, just sharing what I'm finding. We have a steady trickle of the problem in our production environment, apparently only to one table and I can't for the life of me recreate it outside of that.. and all the tools, logs, statuses etc suggest it's not happening.

rascalDan avatar Mar 24 '21 17:03 rascalDan

We've now witnessed evidence of records missing (so an insert being dropped) from other tables, which suggests this is isn't limited scope problem. That makes sense as we could never explain why it would be. Unfortunately, we're also no closer to explaining why this is happening at all.

rascalDan avatar Apr 19 '21 14:04 rascalDan

@rascalDan did you end up finding a solution? e.g. adding some configs such as replica_batch_size = 1000000

FreCap avatar Apr 27 '22 13:04 FreCap

@FreCap I'm afraid not... we've never bottomed the problem and currently have a second process comparing/repairing data that's been missed... which I'd love to get rid of! (but are currently extending to cover other tables)

rascalDan avatar Apr 29 '22 16:04 rascalDan

Could you please share a fully runnable SQL and pg_chameleon example to reproduce so we can check on my side as well?

I saw your previous description, but to get everybody on the same page and allow reproducing, I think it would be instrumental in finding a way to remove the compare/repair feature (which scares me out quite a bit).

FreCap avatar Apr 29 '22 17:04 FreCap

Honestly, not really. Most of our code is just insert into foo(...) values(?, ?, ?) or update foo set bar = ? where bar = ?... and sometimes they don't get replicated. I appreciate that our problem might not be directly related to the batch size problem.

rascalDan avatar May 03 '22 10:05 rascalDan