oceanbase icon indicating copy to clipboard operation
oceanbase copied to clipboard

[Bug]: 疑似多列联合主键导致偶发数据库连接失败 || [Bug]: Suspected multi-column joint primary key causes occasional database connection failure

Open CodePlayer opened this issue 1 year ago • 2 comments

Describe the bug 我们有2个汇总统计表,表结构大致如下:

CREATE TABLE `order_stat_hourly` (
    `merchant_id` int unsigned NOT NULL,
    `period` timestamp NOT NULL COMMENT '每个周期的开始时间点,也就是每个小时的整点',
    `user_id` int unsigned NOT NULL,
    `pay_amount_sum` int unsigned NOT NULL DEFAULT '0',    
    `vendor` varchar(4) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
    `product_type` tinyint unsigned NOT NULL,
    `product_id` smallint unsigned NOT NULL,
    `status` tinyint NOT NULL DEFAULT '1',
    UNIQUE KEY `uk_period_merchantId_userId_vendor_productId` (`period`,`merchant_id`,`user_id`,`vendor`,`product_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
    PARTITION BY RANGE (UNIX_TIMESTAMP( `period` ))
        (
        PARTITION P20240326 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-27'))
        , PARTITION P20240327 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-28'))
        , PARTITION P20240328 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-29'))
    );


CREATE TABLE `order_stat_daily` (
    `id` int unsigned NOT NULL AUTO_INCREMENT,
    `merchant_id` int unsigned NOT NULL,
    `period` timestamp NOT NULL COMMENT '每个周期的开始时间点,也就是每天的0点',
    `user_id` int unsigned NOT NULL,
    `pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
    `status` tinyint NOT NULL DEFAULT '1',
    PRIMARY KEY (`id`),
    UNIQUE KEY `uk_period_merchantId_userId` (`period`,`merchant_id`,`user_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

order_stat_hourly 表 用于按小时汇总订单数据,该表使用 period, merchant_id, user_id, vendor, product_id 作为联合主键,并按天分区。

SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?

当我们使用该表进行如上查询时,应用程序偶发会触发以下异常,偶发的频率大概是每分钟 1~2 次。 然而,在其他所有条件都不变的情况下,我们将上述SQL中的数据表改为 order_stat_daily 时,该异常就不再触发,也就是恢复正常了。

根据两个表的结构差异,我们怀疑大概率是因为 order_stat_hourly 表是用 5个字段 作为联合主键 的缘故。

Environment

  • OS Version: CentOS 8.2 x86_64
  • OB Version: OceanBase_CE_V4.2.1.7
  • JDBC 驱动是 com.mysql:mysql-connector-j:8.4.0 。

Expected behavior 希望一切正常,不再偶发报错。

Actual Behavior

org.springframework.dao.RecoverableDataAccessException: 
### Error querying database.  Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
### The error occurred while setting parameters
### SQL: SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
### Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
; Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.; nested exception is com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
	at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:100)
	at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)	
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
	at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:165)
	at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:55)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
	at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:58)
	at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:72)

Additional context 应用程序使用的是 HikariCP 的连接池,3个 OBServer 构成 1-1-1 的集群,3个OBProxy分别部署在与3个OBServer相同的机器上,并使用 ELB 实现 负载均衡。

BTW, 使用 order_stat_hourly 表进行上述查询,比使用 order_stat_daily 会消耗更多的 数据库 CPU 资源( 只查询 1-3天内的数据, 但并不算多的 order_stat_hourly 表查询 就会让 CPU 满载 )。

此外,我们还发现,在应用程序的 SQL 中添加 弱一致性读 的 SQL Hint,例如:SELECT /*+ READ_CONSISTENCY(WEAK) */ FROM table_name ( 任何表都可能会触发 ),也大概率会触发相同的错误。


Describe the bug We have 2 summary statistics tables, the table structure is roughly as follows:

CREATE TABLE `order_stat_hourly` (
    `merchant_id` int unsigned NOT NULL,
    `period` timestamp NOT NULL COMMENT 'The starting time of each period, that is, the hour on the hour',
    `user_id` int unsigned NOT NULL,
    `pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
    `vendor` varchar(4) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
    `product_type` tinyint unsigned NOT NULL,
    `product_id` smallint unsigned NOT NULL,
    `status` tinyint NOT NULL DEFAULT '1',
    UNIQUE KEY `uk_period_merchantId_userId_vendor_productId` (`period`,`merchant_id`,`user_id`,`vendor`,`product_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
    PARTITION BY RANGE (UNIX_TIMESTAMP( `period` ))
        (
        PARTITION P20240326 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-27'))
        , PARTITION P20240327 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-28'))
        , PARTITION P20240328 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-29'))
    );


CREATE TABLE `order_stat_daily` (
    `id` int unsigned NOT NULL AUTO_INCREMENT,
    `merchant_id` int unsigned NOT NULL,
    `period` timestamp NOT NULL COMMENT 'The starting time of each period, which is 0 o'clock every day',
    `user_id` int unsigned NOT NULL,
    `pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
    `status` tinyint NOT NULL DEFAULT '1',
    PRIMARY KEY (`id`),
    UNIQUE KEY `uk_period_merchantId_userId` (`period`,`merchant_id`,`user_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

The order_stat_hourly table is used to summarize order data by hour. The table uses period, merchant_id, user_id, vendor, product_id as the joint primary key and is partitioned by day.

SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?

When we use this table to perform the above query, the application will trigger the following exception occasionally, and the frequency of the occurrence is about 1~2 times per minute. However, when all other conditions remain unchanged, when we change the data table in the above SQL to order_stat_daily, the exception will no longer be triggered, that is, it will return to normal.

Based on the structural differences between the two tables, we suspect that it is most likely because the order_stat_hourly table uses 5 fields as the joint primary key.

Environment

  • OS Version: CentOS 8.2 x86_64
  • OB Version: OceanBase_CE_V4.2.1.7
  • The JDBC driver is com.mysql:mysql-connector-j:8.4.0.

Expected behavior I hope everything goes fine and no more occasional errors occur.

Actual Behavior

org.springframework.dao.RecoverableDataAccessException:
### Error querying database. Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
### The error occurred while setting parameters
### SQL: SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
### Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
; Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.; nested exception is com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:100)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:165)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:55)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:58)
at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:72)

Additional context The application uses HikariCP's connection pool. Three OBServers form a 1-1-1 cluster. Three OBProxy are deployed on the same machine as the three OBServers, and ELB is used to achieve load balancing.

BTW, using the order_stat_hourly table to perform the above query will consume more database CPU resources than using order_stat_daily (only querying data within 1-3 days, but not many order_stat_hourly table queries will fully load the CPU ).

In addition, we also found that adding weakly consistent read SQL Hint in the application's SQL, for example: SELECT /*+ READ_CONSISTENCY(WEAK) */ FROM table_name (any table may trigger), There is also a high probability that the same error will be triggered.

CodePlayer avatar Jul 24 '24 13:07 CodePlayer

补充一点,order_stat_hourly 这个分区表加入了一个表组(会和表组内的其他表的 主副本 都聚集在同一个Zone,表组 SHARDING = 'NONE' )。

order_stat_daily 表没有加入任何表组。

不知道是否有这个也有关系,我尽量提供更多的差异信息。


In addition, the partition table order_stat_hourly is added to a table group (it will be gathered in the same Zone as the primary copies of other tables in the table group, table group SHARDING = 'NONE').

The order_stat_daily table is not joined to any table group.

Not sure if this is relevant, I'll try to provide more information on the differences.

CodePlayer avatar Jul 25 '24 02:07 CodePlayer

image order_stat_hourly order_stat_daily 两表的数据量比例约为 4 : 1 。

30分钟内 查询 小时表 1000 多次,CPU占比就达到了 90+% 30分钟内 查询 天表 3000 多次,CPU占比才 2+%

CodePlayer avatar Jul 25 '24 08:07 CodePlayer