[Bug]: 疑似多列联合主键导致偶发数据库连接失败 || [Bug]: Suspected multi-column joint primary key causes occasional database connection failure
Describe the bug æä»¬æ2ä¸ªæ±æ»ç»è®¡è¡¨ï¼è¡¨ç»æå¤§è´å¦ä¸ï¼
CREATE TABLE `order_stat_hourly` (
`merchant_id` int unsigned NOT NULL,
`period` timestamp NOT NULL COMMENT 'æ¯ä¸ªå¨æçå¼å§æ¶é´ç¹ï¼ä¹å°±æ¯æ¯ä¸ªå°æ¶çæ´ç¹',
`user_id` int unsigned NOT NULL,
`pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
`vendor` varchar(4) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
`product_type` tinyint unsigned NOT NULL,
`product_id` smallint unsigned NOT NULL,
`status` tinyint NOT NULL DEFAULT '1',
UNIQUE KEY `uk_period_merchantId_userId_vendor_productId` (`period`,`merchant_id`,`user_id`,`vendor`,`product_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (UNIX_TIMESTAMP( `period` ))
(
PARTITION P20240326 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-27'))
, PARTITION P20240327 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-28'))
, PARTITION P20240328 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-29'))
);
CREATE TABLE `order_stat_daily` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`merchant_id` int unsigned NOT NULL,
`period` timestamp NOT NULL COMMENT 'æ¯ä¸ªå¨æçå¼å§æ¶é´ç¹ï¼ä¹å°±æ¯æ¯å¤©ç0ç¹',
`user_id` int unsigned NOT NULL,
`pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
`status` tinyint NOT NULL DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_period_merchantId_userId` (`period`,`merchant_id`,`user_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
order_stat_hourly 表 ç¨äºæå°æ¶æ±æ»è®¢åæ°æ®ï¼è¯¥è¡¨ä½¿ç¨ period, merchant_id, user_id, vendor, product_id ä½ä¸ºèå主é®ï¼å¹¶æå¤©ååºã
SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
å½æä»¬ä½¿ç¨è¯¥è¡¨è¿è¡å¦ä¸æ¥è¯¢æ¶ï¼åºç¨ç¨åºå¶åä¼è§¦å以ä¸å¼å¸¸ï¼å¶åçé¢çå¤§æ¦æ¯æ¯åé 1~2 次ã
ç¶èï¼å¨å
¶ä»æææ¡ä»¶é½ä¸åçæ
åµä¸ï¼æä»¬å°ä¸è¿°SQLä¸çæ°æ®è¡¨æ¹ä¸º order_stat_daily æ¶ï¼è¯¥å¼å¸¸å°±ä¸å触åï¼ä¹å°±æ¯æ¢å¤æ£å¸¸äºã
æ ¹æ®ä¸¤ä¸ªè¡¨çç»æå·®å¼ï¼æä»¬æç大æ¦çæ¯å 为 order_stat_hourly 表æ¯ç¨ 5ä¸ªåæ®µ ä½ä¸ºèåä¸»é® çç¼æ
ã
Environment
- OS Version: CentOS 8.2 x86_64
- OB Version:
OceanBase_CE_V4.2.1.7 - JDBC 驱卿¯
com.mysql:mysql-connector-j:8.4.0ã
Expected behavior 叿ä¸åæ£å¸¸ï¼ä¸åå¶åæ¥éã
Actual Behavior
org.springframework.dao.RecoverableDataAccessException:
### Error querying database. Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
### The error occurred while setting parameters
### SQL: SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
### Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
; Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.; nested exception is com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:100)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:165)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:55)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:58)
at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:72)
Additional context
åºç¨ç¨åºä½¿ç¨çæ¯ HikariCP çè¿æ¥æ± ï¼3个 OBServer ææ 1-1-1 çé群ï¼3个OBProxyåå«é¨ç½²å¨ä¸3个OBServerç¸åçæºå¨ä¸ï¼å¹¶ä½¿ç¨ ELB å®ç° è´è½½åè¡¡ã
BTWï¼ ä½¿ç¨ order_stat_hourly 表è¿è¡ä¸è¿°æ¥è¯¢ï¼æ¯ä½¿ç¨ order_stat_daily 伿¶èæ´å¤ç æ°æ®åº CPU èµæºï¼ åªæ¥è¯¢ 1-3天å
çæ°æ®ï¼ ä½å¹¶ä¸ç®å¤ç order_stat_hourly 表æ¥è¯¢ å°±ä¼è®© CPU 满载 ï¼ã
æ¤å¤ï¼æä»¬è¿åç°ï¼å¨åºç¨ç¨åºç SQL 䏿·»å å¼±ä¸è´æ§è¯» ç SQL Hintï¼ä¾å¦ï¼SELECT /*+ READ_CONSISTENCY(WEAK) */ FROM table_name ï¼ ä»»ä½è¡¨é½å¯è½ä¼è§¦å ï¼ï¼ä¹å¤§æ¦çä¼è§¦åç¸åçé误ã
Describe the bug We have 2 summary statistics tables, the table structure is roughly as follows:
CREATE TABLE `order_stat_hourly` (
`merchant_id` int unsigned NOT NULL,
`period` timestamp NOT NULL COMMENT 'The starting time of each period, that is, the hour on the hour',
`user_id` int unsigned NOT NULL,
`pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
`vendor` varchar(4) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
`product_type` tinyint unsigned NOT NULL,
`product_id` smallint unsigned NOT NULL,
`status` tinyint NOT NULL DEFAULT '1',
UNIQUE KEY `uk_period_merchantId_userId_vendor_productId` (`period`,`merchant_id`,`user_id`,`vendor`,`product_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (UNIX_TIMESTAMP( `period` ))
(
PARTITION P20240326 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-27'))
, PARTITION P20240327 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-28'))
, PARTITION P20240328 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-29'))
);
CREATE TABLE `order_stat_daily` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`merchant_id` int unsigned NOT NULL,
`period` timestamp NOT NULL COMMENT 'The starting time of each period, which is 0 o'clock every day',
`user_id` int unsigned NOT NULL,
`pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
`status` tinyint NOT NULL DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_period_merchantId_userId` (`period`,`merchant_id`,`user_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The order_stat_hourly table is used to summarize order data by hour. The table uses period, merchant_id, user_id, vendor, product_id as the joint primary key and is partitioned by day.
SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
When we use this table to perform the above query, the application will trigger the following exception occasionally, and the frequency of the occurrence is about 1~2 times per minute.
However, when all other conditions remain unchanged, when we change the data table in the above SQL to order_stat_daily, the exception will no longer be triggered, that is, it will return to normal.
Based on the structural differences between the two tables, we suspect that it is most likely because the order_stat_hourly table uses 5 fields as the joint primary key.
Environment
- OS Version: CentOS 8.2 x86_64
- OB Version:
OceanBase_CE_V4.2.1.7 - The JDBC driver is
com.mysql:mysql-connector-j:8.4.0.
Expected behavior I hope everything goes fine and no more occasional errors occur.
Actual Behavior
org.springframework.dao.RecoverableDataAccessException:
### Error querying database. Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
### The error occurred while setting parameters
### SQL: SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
### Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
; Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.; nested exception is com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:100)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:165)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:55)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:58)
at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:72)
Additional context
The application uses HikariCP's connection pool. Three OBServers form a 1-1-1 cluster. Three OBProxy are deployed on the same machine as the three OBServers, and ELB is used to achieve load balancing.
BTW, using the order_stat_hourly table to perform the above query will consume more database CPU resources than using order_stat_daily (only querying data within 1-3 days, but not many order_stat_hourly table queries will fully load the CPU ).
In addition, we also found that adding weakly consistent read SQL Hint in the application's SQL, for example: SELECT /*+ READ_CONSISTENCY(WEAK) */ FROM table_name (any table may trigger), There is also a high probability that the same error will be triggered.
补充一点,order_stat_hourly 这个分区表加入了一个表组(会和表组内的其他表的 主副本 都聚集在同一个Zone,表组 SHARDING = 'NONE' )。
order_stat_daily 表没有加入任何表组。
不知道是否有这个也有关系,我尽量提供更多的差异信息。
In addition, the partition table order_stat_hourly is added to a table group (it will be gathered in the same Zone as the primary copies of other tables in the table group, table group SHARDING = 'NONE').
The order_stat_daily table is not joined to any table group.
Not sure if this is relevant, I'll try to provide more information on the differences.
order_stat_hourly 和 order_stat_daily 两表的数据量比例约为 4 : 1 。
30分钟内 查询 小时表 1000 多次,CPU占比就达到了 90+% 30分钟内 查询 天表 3000 多次,CPU占比才 2+%