pg_chameleon icon indicating copy to clipboard operation
pg_chameleon copied to clipboard

Crash in read_replica due to decoding byte 0x9d

Open pyrite357 opened this issue 4 years ago • 9 comments

Describe the bug Fresh install of 2.0.16, started replica, and 10 hours later, randomly crashed decoding something in the source

To Reproduce I'm not sure what table/row/col it had trouble decoding, is there a way for me to tell?

Expected behavior This same datasource wasn't having this trouble in 2.0.12 of pg_chameleon, but it could be new data entered, i'm not sure

I also expected to see something in the error log, but nothing shows up when running chameleon show_errors

Screenshots If applicable, add screenshots to help explain your problem.

(venv) [user@host ~]$ chameleon start_replica --config default --source rpmv2 --debug
2020-09-28 12:39:08 MainProcess DEBUG pg_lib.py (659): Changing the autocommit flag to True
2020-09-28 12:39:08 MainProcess DEBUG pg_lib.py (659): Changing the autocommit flag to True
2020-09-28 12:39:08 MainProcess DEBUG pg_lib.py (659): Changing the autocommit flag to True
2020-09-28 12:39:08 MainProcess INFO global_lib.py (613): Checking if the replica for source rpmv2 is stopped 
2020-09-28 12:39:08 MainProcess INFO global_lib.py (623): Cleaning not processed batches for source rpmv2
2020-09-28 12:39:08 MainProcess DEBUG pg_lib.py (3238): Cleaning table t_log_replica_rpmv2_1
2020-09-28 12:39:08 MainProcess DEBUG pg_lib.py (3238): Cleaning table t_log_replica_rpmv2_2
2020-09-28 12:39:08 MainProcess INFO global_lib.py (554): Starting the replica daemons for source rpmv2 
2020-09-28 12:39:08 MainProcess DEBUG global_lib.py (563): Replica process for source rpmv2 is running
2020-09-28 12:39:08 read_replica DEBUG pg_lib.py (659): Changing the autocommit flag to True
2020-09-28 12:39:08 MainProcess DEBUG pg_lib.py (659): Changing the autocommit flag to True
2020-09-28 12:39:08 MainProcess DEBUG pg_lib.py (1276): Cleaning replayed batches for source rpmv2 older than 1 day
2020-09-28 12:39:08 read_replica DEBUG pg_lib.py (679): There is already a database connection active.
2020-09-28 12:39:08 read_replica DEBUG pg_lib.py (3110): Collecting schema mappings for source rpmv2
2020-09-28 12:39:08 replay_replica DEBUG pg_lib.py (659): Changing the autocommit flag to True
2020-09-28 12:39:08 read_replica DEBUG mysql_lib.py (1452): Batch data [(387, 'rpmv2-bin-logs.000001', 36208279, 't_log_replica_rpmv2_1', None)] 
2020-09-28 12:39:08 read_replica DEBUG mysql_lib.py (1000): collecting table type map
2020-09-28 12:39:08 read_replica DEBUG mysql_lib.py (1217): GTID DISABLED - log_file rpmv2-bin-logs.000001, log_position 36208279. id_batch: 387 
2020-09-28 12:39:08 read_replica DEBUG mysql_lib.py (1231): ROTATE EVENT - binlogfile rpmv2-bin-logs.000001, position 36208279. 
2020-09-28 12:39:09 MainProcess ERROR global_lib.py (567): Read process alive: False - Replay process alive: True
2020-09-28 12:39:09 MainProcess ERROR global_lib.py (568): Stack trace: Traceback (most recent call last):
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pg_chameleon/lib/global_lib.py", line 496, in read_replica
    self.mysql_source.read_replica()
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pg_chameleon/lib/mysql_lib.py", line 1453, in read_replica
    replica_data=self.__read_replica_stream(batch_data)
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pg_chameleon/lib/mysql_lib.py", line 1311, in __read_replica_stream
    for row in binlogevent.rows:
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pymysqlreplication/row_event.py", line 433, in rows
    self._fetch_rows()
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pymysqlreplication/row_event.py", line 428, in _fetch_rows
    self.__rows.append(self._fetch_one_row())
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pymysqlreplication/row_event.py", line 519, in _fetch_one_row
    row["after_values"] = self._read_column_data(self.columns_present_bitmap2)
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pymysqlreplication/row_event.py", line 136, in _read_column_data
    values[name] = self.__read_string(column.length_size, column)
  File "/opt/data/rpmfsadm/venv/lib64/python3.6/site-packages/pymysqlreplication/row_event.py", line 224, in __read_string
    string = string.decode(encoding)
  File "/usr/lib64/python3.6/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 60: character maps to <undefined>

2020-09-28 12:39:09 MainProcess ERROR global_lib.py (574): Read daemon crashed. Terminating the replay daemon.
2020-09-28 12:39:09 MainProcess DEBUG pg_lib.py (659): Changing the autocommit flag to True
2020-09-28 12:39:09 MainProcess INFO global_lib.py (599): Replica process for source rpmv2 ended

Environment(please complete the following information):

  • OS: CentOS 8.2
  • MySQL Version: MariaDB Server version: 10.1.19-MariaDB - mariadb.org binary distribution
  • PostgreSQL Version: 12.4
  • Python Version: 3.6.8
  • Cloud hosted database: local install

Additional context Add any other context about the problem here.

pyrite357 avatar Sep 28 '20 19:09 pyrite357

More info:

(venv) [user@host ~]$ pip list
Package           Version
----------------- ----------
certifi           2019.11.28
chardet           3.0.4
daemonize         2.5.0
idna              2.9
mysql-replication 0.22
pg-chameleon      2.0.16
pip               20.2.3
psycopg2-binary   2.8.4
PyMySQL           0.10.1
PyYAML            5.3
requests          2.23.0
rollbar           0.14.7
setuptools        39.2.0
six               1.14.0
tabulate          0.8.6
urllib3           1.25.8
wheel             0.35.1

pyrite357 avatar Sep 28 '20 19:09 pyrite357

the offending character is 0x9d which should be the right double quotation mark ” what's the database/table character set? it could be a setting ignored by the older pymysql and now enforced by the 0.10.

the4thdoctor avatar Oct 01 '20 06:10 the4thdoctor

I seem to have fixed my problem by re-initing the source, then enable, then start. But I'm waiting to see if it crashes again in 10 hours time or not. Please leave this issue open until I confirm :)


I kind of figured out how to view the binlog in phpMyAdmin, although MySQL comes with a whole utility (mysqlbinlog) to do this too, which I didn't try, but the SQL is:

SHOW BINLOG EVENTS FROM x

Where x corresponds to the position in the debug line below (x = 36208279)

2020-09-28 12:39:08 read_replica DEBUG mysql_lib.py (1452): Batch data [(387, 'rpmv2-bin-logs.000001', 36208279, 't_log_replica_rpmv2_1', None)]

It would be nice if the mysql_lib told me what table it was that had the offending data, but since it does not, this is how you can go find it. Leaving this info here for anyone who may have the need to investigate this in the future.

pyrite357 avatar Oct 05 '20 23:10 pyrite357

cool, thanks for the feedback. I'm still convinced it's an encoding issue added in the recent pymysql update. I'll try to reproduce the problem and eventually come with a way to trap the error.

the4thdoctor avatar Oct 06 '20 06:10 the4thdoctor

I'm not sure if it's related, but my MySQL source has 2 databases it is replicating, and in one of them I found a table with a tinyint(1) column that had a value of 2, and my config has type_override on. I had seen some errors in the PostgreSQL logs regarding this, which I think was throwing errors when I did init_replica or start_replica. I have now changed that column to a normal integer. This might have just been a completely separate problem, I'm not sure. But it fixed that particular problem.

You had asked about the charset though. In the default.yml, it's set to utf8 for charset under db_conn. The source has 2 databases though (schema_mappings), and each one has its own collation. database1 has latin1_swedish_ci, but also has some tables with utf8_general_ci collations also.

database1 is latin1/latin1_swedish_ci for the database itself. database2 has 25 tables, 24 of them are utf8_general_ci and 1 of them is latin1_swedish_ci. The charset of the database itself is utf8 though.

One possible area of trouble is only being able to define a MySQL source as one particular charset, but then having that source have multiple databases each with a different charset than the one defined in the default.yml config for the source db_conn. So if you are right about it being an encoding issue, could that be why perhaps?

pyrite357 avatar Nov 06 '20 15:11 pyrite357

Thanks for your feedback. The issue with the override to boolean is normal. boolean ca store only true/false and postgresql is very strict about that. The override was implemented to have the tinyint translated if the field is used as a boolean (if i correctly remember MySQL 5.6 doesn't have boolean type). If you use tinyint to store integers just remove the override.

For the encoding, as the error is caused by the driver PyMySQL during the replica process. I can check if there is something new implemented in PyMySQL that caused the issue with the newer library. Thanks for now.

the4thdoctor avatar Nov 07 '20 10:11 the4thdoctor

Still having this issue.

image

Have you done any investigative work since Nov 7 @the4thdoctor ? Happy New Year btw!

pyrite357 avatar Jan 06 '21 03:01 pyrite357

Good news, I can reproduce it now, it is a fancy quote causing the issue. The charset of the table in question is latin1_swedish_ci, however the majority of that source database are tables using utf8_general_ci. In pg_chameleon, the source charset is set to utf8. I'm going to work on seeing what I can do to fix the charsets in my source database.

I don't mind taking responsibility for my source database and the poor normalization it had, however it would be an improvement to pg_chameleon to fail gracefully when/if this happens (and not just crash), maybe even write something to the error log, and skip over a transaction, but keep running, that way way production isn't completely down/broken.

pyrite357 avatar Jan 06 '21 21:01 pyrite357

@the4thdoctor Even though it was a fancy quote in plain view, it was in fact the ✔️ character in full. The source MySQL database had a mix of latini1_swedish_ci and utf8 collations, and some ascii_general_ci ones too. But the pg_chameleon default.yml file can only define one characterset for the source database. I fixed it by updating all the tables/columns in the source MySQL database to utf8_general_ci, and now that character no longer causes pg_chameleon to stop/break.

pyrite357 avatar May 03 '22 23:05 pyrite357