client-java icon indicating copy to clipboard operation
client-java copied to clipboard

[bug] RegionCDCClient grpc stream don't receive ChangeDataEvent from tikv store after several hours,because tikv region transfer leader

Open xieyi888 opened this issue 3 years ago • 9 comments

Bug Report

1. Describe the bug

RegionCDCClient grpc stream don't receive ChangeDataEvent from tikv server after several hours,because tikv region transfer leader

2. Minimal reproduce step (Required)

1.prepare a job: use CDCClient to get Tidb table's changeDateEvent from tikv server

table : REGION_ID:1364581 LEADER_STORE_ID: 3544963

mysql> show table test_table regions \G;
*************************** 1. row ***************************
           REGION_ID: 1364581
           START_KEY: t_853_
             END_KEY: t_857_
           LEADER_ID: 3697249
     LEADER_STORE_ID: 3544963
               PEERS: 3587852, 3697249, 3753707
          SCATTERING: 0
       WRITTEN_BYTES: 0
          READ_BYTES: 0
APPROXIMATE_SIZE(MB): 37
    APPROXIMATE_KEYS: 386258
1 row in set (0.00 sec)

2. run job

3. after several hours or day, the tidb table region transfer leader

table: REGION_ID: 1364581 , LEADER_STORE_ID: 3544961

mysql> show tabletest_table regions \G;
*************************** 1. row ***************************
           REGION_ID: 1364581
           START_KEY: t_853_
             END_KEY: t_857_
           LEADER_ID: 3753707
     LEADER_STORE_ID: 3544961
               PEERS: 3587852, 3697249, 3753707
          SCATTERING: 0
       WRITTEN_BYTES: 0
          READ_BYTES: 0
APPROXIMATE_SIZE(MB): 37
    APPROXIMATE_KEYS: 386258
1 row in set (0.00 sec)

log in tikv server It shows that [region_id=1364581] has transfer leader[from_peer="id: 3697249 store_id: 3544963"] [to_peer="id: 3753707 store_id: 3544961"]

[2022/06/28 20:11:27.878 +08:00] [INFO] [pd.rs:930] ["try to transfer leader"] [to_peer="id: 3753707 store_id: 3544961"] [from_peer="id: 3697249 store_id: 3544963"] [region_id=1364581]
[2022/06/28 20:11:27.878 +08:00] [INFO] [peer.rs:2627] ["transfer leader"] [peer="id: 3753707 store_id: 3544961"] [peer_id=3697249] [region_id=1364581]
[2022/06/28 20:11:27.878 +08:00] [INFO] [raft.rs:1868] ["[term 406] starts to transfer leadership to 3753707"] [lead_transferee=3753707] [term=406] [raft_id=3697249] [region_id=1364581]
[2022/06/28 20:11:27.878 +08:00] [INFO] [raft.rs:1881] ["sends MsgTimeoutNow to 3753707 immediately as 3753707 already has up-to-date log"] [lead_transferee=3753707] [raft_id=3697249] [region_id=1364581]
[2022/06/28 20:11:27.879 +08:00] [INFO] [raft.rs:1332] ["received a message with higher term from 3753707"] ["msg type"=MsgRequestVote] [message_term=407] [term=406] [from=3753707] [raft_id=3697249] [region_id=1364581]
[2022/06/28 20:11:27.879 +08:00] [INFO] [raft.rs:1088] ["became follower at term 407"] [term=407] [raft_id=3697249] [region_id=1364581]

4. CDCClient program log shows ,it don't receive ChangeDataEvent from tikv server

After 2022-06-28 20:11:27(the time that tikv region transfer leader): (1) client don't submit event(see log below). (2) RegionCDCClient.onNext never be called (I use arthas stack, watch ,find that onNext don't be called anymore ) https://github.com/tikv/client-java/blob/release-3.2/src/main/java/org/tikv/cdc/RegionCDCClient.java#L173 (3) streamId=19(at start client request for ChangeDataEvent) don`t receive message.(see log below) log in CDCClient program

//comment by author: streamId=19,client can receive message from tikv server,then submit event,handle resolvedTs
2022-06-28 19:42:45,943 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND PING: ack=true bytes=1234
2022-06-28 19:42:45,997 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND DATA: streamId=19 padding=0 endStream=false length=21 bytes=0000000010120e08e5a453109580b0d9be8faa8306
2022-06-28 19:42:45,997 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] OUTBOUND PING: ack=false bytes=1234
2022-06-28 19:42:45,998 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND PING: ack=true bytes=1234
2022-06-28 19:42:45,998 DEBUG org.tikv.cdc.RegionCDCClient                                 [] - submit event: CDCEvent[RESOLVED_TS] {resolvedTs=434219664267804693}
2022-06-28 19:42:45,999 INFO  org.tikv.cdc.CDCClient                                       [] - handle resolvedTs: 434219664267804693, regionId: 1364581
......
......
......

//comment by author: the last time streamId=19,client receive message from tikv server,After this,it don`t receive any message,but endStream=false.

2022-06-28 20:11:27,849 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND DATA: streamId=19 padding=0 endStream=false length=21 bytes=0000000010120e08e5a45310a480d098d09caa8306
2022-06-28 20:11:27,849 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] OUTBOUND PING: ack=false bytes=1234
2022-06-28 20:11:27,849 DEBUG org.tikv.cdc.RegionCDCClient                                 [] - submit event: CDCEvent[RESOLVED_TS] {resolvedTs=434220115640451108}
2022-06-28 20:11:27,849 INFO  org.tikv.cdc.CDCClient                                       [] - handle resolvedTs: 434220115640451108, regionId: 1364581
2022-06-28 20:11:27,850 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND PING: ack=true bytes=1234
2022-06-28 20:11:27,878 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/${client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND DATA: streamId=19 padding=0 endStream=false length=21 bytes=00000000100a0e08e5a45338012a060a0408e5a453

//although streamId=19 don`t receive message(submit event......),client can request grpc.health.v1.Health/Check and receive  response  from tikv server
2022-06-28 20:11:27,938 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/{client_ip}:36854 - R:/${tikv_ip1}:40160] OUTBOUND HEADERS: streamId=42571 headers=GrpcHttp2OutboundHeaders[:authority: ${tikv_ip1}:40160, :path: /grpc.health.v1.Health/Check, :method: POST, :scheme: http, content-type: application/grpc, te: trailers, user-agent: grpc-java-netty/1.38.0, grpc-accept-encoding: gzip, grpc-timeout: 99986493n] streamDependency=0 weight=16 exclusive=false padding=0 endStream=false
2022-06-28 20:11:27,938 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/{client_ip}:36854 - R:/${tikv_ip1}:40160] OUTBOUND DATA: streamId=42571 padding=0 endStream=true length=5 bytes=0000000000
2022-06-28 20:11:27,939 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/{client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND HEADERS: streamId=42571 headers=GrpcHttp2ResponseHeaders[:status: 200, content-type: application/grpc, grpc-accept-encoding: identity,deflate,gzip, accept-encoding: identity,gzip] padding=0 endStream=false
2022-06-28 20:11:27,939 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/{client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND DATA: streamId=42571 padding=0 endStream=false length=7 bytes=00000000020801
2022-06-28 20:11:27,939 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/{client_ip}:36854 - R:/${tikv_ip1}:40160] OUTBOUND PING: ack=false bytes=1234
2022-06-28 20:11:27,939 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/{client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND HEADERS: streamId=42571 headers=GrpcHttp2ResponseHeaders[grpc-status: 0, grpc-message: ] padding=0 endStream=true
2022-06-28 20:11:27,939 DEBUG org.tikv.shade.io.grpc.netty.NettyClientHandler              [] - [id: 0xf889fe77, L:/{client_ip}:36854 - R:/${tikv_ip1}:40160] INBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=5

3. What did you see instead (Required)

(1) RegionCDCClient get ChangeDataEvent from tikv server based on grpc service: ChangeData protobuf: https://github.com/pingcap/kvproto/blob/master/proto/cdcpb.proto#L170 (2) it's Bidirectional streaming RPC. I find the client Only request once at start:https://github.com/tikv/client-java/blob/release-3.2/src/main/java/org/tikv/cdc/RegionCDCClient.java#L115 (3) In theory,althoght client request only once, tikv server can response continuously, But In my situation, tikv transfer leader([from_peer="id: 3697249 store_id: 3544963"] [to_peer="id: 3753707 store_id: 3544961"] ).
RegionCdcClient only request once at start, so it could not receive ChangeDataEvent from new leader tikv store image

5. What are your Java Client and TiKV versions? (Required)

Client Java:3.2.0 TiKV:5.1.4

xieyi888 avatar Jul 01 '22 10:07 xieyi888

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Aug 06 '22 00:08 github-actions[bot]

Me too. Did the owner solve it?

MengXiangDing avatar Sep 07 '22 12:09 MengXiangDing

CDC client was contributed by a community contributor a year ago. Unfortunately, there is nobody officially taking care of this component. We would like to see somebody would like to fork it and build a separate TiDB CDC java client project.

sunxiaoguang avatar Sep 19 '22 13:09 sunxiaoguang

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Oct 20 '22 01:10 github-actions[bot]

Me too. Did the owner solve it?

hezhenghongmail avatar Jul 18 '23 06:07 hezhenghongmail

@xieyi888

hezhenghongmail avatar Jul 18 '23 06:07 hezhenghongmail

Me too. Did the owner solve it?

Flink-cdc community fixed this bug for reference.

https://github.com/ververica/flink-cdc-connectors/pull/1632

MengXiangDing avatar Jul 19 '23 02:07 MengXiangDing

Me too. Did the owner solve it?

Flink-cdc community fixed this bug for reference.

ververica/flink-cdc-connectors#1632

This commit can't resolve this issue(region transfer leader),because it just retry to read from the same region. Have you make some other changes so you can solve this issue? I've met the same question. @MengXiangDing

Mondler-Agmento avatar Nov 09 '23 07:11 Mondler-Agmento