python-mysql-replication icon indicating copy to clipboard operation
python-mysql-replication copied to clipboard

program stuck in __memcpy_ssse3_back , cpu 100%

Open nkr-404-NotFound opened this issue 2 years ago • 16 comments

Hi, I m using the project to do mysql binlog to redis thing.

but in some cases, my cpu goes up to 100% and the sync process seems to hang.

i did some digging and found the stack always stop at

Program received signal SIGINT, Interrupt. 0x00007fe57a06e698 in __memcpy_ssse3_back () from /lib64/libc.so.6 (gdb) py-bt Traceback (most recent call first): File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 148, in read self.__data_buffer = self.__data_buffer[size:] File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 266, in read_variable_length_string byte = byte2int(self.read(1)) File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 361, in read_binary_json_type return self.read_variable_length_string() File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 463, in _read return self.read_binary_json_type(x[0], length) File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 465, in return [_read(x) for x in values_type_offset_inline] File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 465, in read_binary_json_array return [_read(x) for x in values_type_offset_inline] File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 359, in read_binary_json_type return self.read_binary_json_array(length - 1, large) File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py", line 352, in read_binary_json return self.read_binary_json_type(t, length) File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/row_event.py", line 181, in _read_column_data values[name] = self.packet.read_binary_json(column.length_size) File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/row_event.py", line 447, in _fetch_one_row row["values"] = self._read_column_data(self.columns_present_bitmap) File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/row_event.py", line 422, in _fetch_rows self.__rows.append(self._fetch_one_row()) File "/usr/local/lib/python3.7/site-packages/pymysqlreplication/row_event.py", line 427, in rows self._fetch_rows() File "/data/release/nebula_cmdb/server/exec/synccache.py", line 395, in del_cache for row in binlogevent.rows:

does anyone have the same issue ? any help is appreciated ~

nkr-404-NotFound avatar Jun 27 '23 02:06 nkr-404-NotFound

version : 0.21

nkr-404-NotFound avatar Jun 27 '23 03:06 nkr-404-NotFound

in database there are lots of json field and field-data is very long, maybe 500-1000 bytes ?

nkr-404-NotFound avatar Jun 27 '23 03:06 nkr-404-NotFound

(gdb) 149 def read(self, size): 150 size = int(size) 151 self.read_bytes += size 152 if len(self.__data_buffer) > 0: 153 data = self.__data_buffer[:size]

154 self.__data_buffer = self.__data_buffer[size:] 155 if len(data) == size: 156 return data 157 else: 158 return data + self.packet.read(size - len(data)) 159 return self.packet.read(size) (gdb) c Continuing.

nkr-404-NotFound avatar Jun 27 '23 06:06 nkr-404-NotFound

image

nkr-404-NotFound avatar Jun 27 '23 06:06 nkr-404-NotFound

in this case, the DelRowEvent buffer len is very long , nearly 1700 KB

image

nkr-404-NotFound avatar Jun 27 '23 06:06 nkr-404-NotFound

i'v tried 0.42 version, same issue.

nkr-404-NotFound avatar Jun 27 '23 06:06 nkr-404-NotFound

after a long time of hang, the current process will be aborted and binlogStream will be reset back to 30 min before, and consume them all over again ( until the BIG json event and stuck again )

nkr-404-NotFound avatar Jun 27 '23 06:06 nkr-404-NotFound

as here

nkr-404-NotFound avatar Jun 27 '23 06:06 nkr-404-NotFound

2023-06-27 14:29:18,854 [INFO] server_num_total : 49650 2023-06-27 14:29:18,854 [INFO] current time : 2023-06-27T14:29:18.854472binlog time : 2023-06-27T14:17:48 2023-06-27 14:34:10,927 [INFO] server_num_total : 49660 2023-06-27 14:34:10,927 [INFO] current time : 2023-06-27T14:34:10.927750binlog time : 2023-06-27T14:17:54 2023-06-27 14:34:11,477 [INFO] server_num_total : 49670 2023-06-27 14:34:11,477 [INFO] current time : 2023-06-27T14:34:11.477635binlog time : 2023-06-27T13:45:35 2023-06-27 14:34:11,550 [INFO] server_num_total : 49680 2023-06-27 14:34:11,551 [INFO] current time : 2023-06-27T14:34:11.551199binlog time : 2023-06-27T13:45:35

super BIG json at 14:17 , after hung for 5min, the next binlog time jumped back to 13:45...

very strange.

nkr-404-NotFound avatar Jun 27 '23 06:06 nkr-404-NotFound

currently, the only way i can restore from this situation is restarting the whole thing

afterwards the binlog offset will be move next to current timestamp and hope not encounter another BIG json event again ...

nkr-404-NotFound avatar Jun 27 '23 08:06 nkr-404-NotFound

if data cache is very long ,

byte = byte2int(self.read(1))

will move the data buffer only 1 byte, cause lots of perf loss here

nkr-404-NotFound avatar Jun 28 '23 02:06 nkr-404-NotFound

found a workaround .. i'v shrink the upstream sql op, batch values from 5000 down to 50

thus can reduce the binlog size ( with more binlog records )

seems work fine ( for a few hours now )

nkr-404-NotFound avatar Jun 28 '23 03:06 nkr-404-NotFound

maybe in some case if data_buffer is very long, should empty it after current op.

otherwise , next read op will consume extreme long time and no way to recover.

nkr-404-NotFound avatar Jun 28 '23 07:06 nkr-404-NotFound

maybe there is some logic error in READ func and the loop always here . around

/usr/local/lib/python3.7/site-packages/pymysqlreplication/packet.py

nkr-404-NotFound avatar Jun 28 '23 09:06 nkr-404-NotFound

image

nkr-404-NotFound avatar Jun 28 '23 10:06 nkr-404-NotFound

@nankingrider hello? can you run code on this PR

sean-k1 avatar Sep 22 '23 07:09 sean-k1