tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

dm-worker keeps retrying to execute ddl when encounter "invalid connection" error

Open sleepymole opened this issue 3 years ago • 9 comments

What did you do?

  1. Replicate data from one MySQL to TiDB.
  2. Execute ddl on upstream MySQL:
ALTER TABLE xxx MODIFY COLUMN xxx SMALLINT(4) NOT NULL DEFAULT _UTF8MB4'0'

What did you expect to see?

No error is reported.

What did you see instead?

DM encountered "invalid connection" error and keeps retrying to execute ddl for every 5 minutes.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v2.0.6

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)

sleepymole avatar Feb 24 '22 08:02 sleepymole

I meet this problem a few days ago,i found it is MaxDDLConnectionTimeoutMinute set 5min cause the problem.

jiyfhust avatar Feb 24 '22 11:02 jiyfhust

can i make a pr to solve the problem?

jiyfhust avatar Feb 24 '22 11:02 jiyfhust

/cc @lance6716

sleepymole avatar Feb 24 '22 12:02 sleepymole

can i make a pr to solve the problem?

welcome! before writing codes, can you give some brief introduction of your fixing? we can discuss the effects in advance.

lance6716 avatar Feb 24 '22 12:02 lance6716

  1. https://github.com/pingcap/tiflow/blob/c34585737ab371126c2cdd0fe196dc23fdf68f78/dm/syncer/syncer.go#L3324

SetReadTimeout(maxDDLConnectionTimeout)

2.https://github.com/pingcap/tiflow/blob/c34585737ab371126c2cdd0fe196dc23fdf68f78/dm/pkg/conn/basedb.go#L97

		dsn += fmt.Sprintf("&readTimeout=%s", rawCfg.ReadTimeout)

3.https://github.com/go-sql-driver/mysql/blob/217d05049e5a88d529b9a2d5fe5675120831efab/dsn.go#L51

Timeout          time.Duration     // Dial timeout
ReadTimeout      time.Duration     // I/O read timeout
WriteTimeout     time.Duration     // I/O write timeout

4.https://github.com/go-sql-driver/mysql/blob/217d05049e5a88d529b9a2d5fe5675120831efab/packets.go#L115

   conn.SetReadDeadline(time.Now().Add(mc.cfg.ReadTimeout))

Because ddl reorganization like add index,modify column may take a long time, is this proper to set ReadTimeout ulimited? this invalid connection error is the ReadTimeout problem.

but The MaxDDLConnectionTimeoutMinute is connection timeout argument, go-sql-driver packet should set the "Timeout time.Duration // Dial timeout"

Is dm not right set the argument?

5.https://github.com/pingcap/tiflow/blob/c34585737ab371126c2cdd0fe196dc23fdf68f78/dm/syncer/error.go#L75

    // it only ignore `invalid connection` error (timeout or other causes) for `ADD INDEX`.
// `invalid connection` means some data already sent to the server,
// and we assume that the whole SQL statement has already sent to the server for this error.
// if we have other methods to judge the DDL dispatched but timeout for executing, we can update this method.
// NOTE: we must ensure other PK/UK exists for correctness.
// NOTE: when we are refactoring the shard DDL algorithm, we also need to consider supporting non-blocking `ADD 
INDEX`.

there is a handler to ignore the invalid connection error when add index, is it also needed if set ReadTimeout unlimited?

so, maybe there is there step:

  1. set ReadTimeout ulimited
  2. add set connection timeout?
  3. remove invalid connetion ignore logic

/cc @lance6716

jiyfhust avatar Feb 25 '22 02:02 jiyfhust

@jiyfhust I have ask the reason why we add a ReadTimeout here, the main scenario is DM needs to distinguish a slow DDL (no COM_QUERY_Response for a long time) from a dead downstream. We can agree that for a slow DDL, DM should not retry it again; for a dead downstream, DM should do something rather than waiting.

"1. set ReadTimeout ulimited" itself can't know a downstream is dead. And I'm not sure how to "2. add set connection timeout". There's are some linux kernel feature about TCP keepalive, if this feature is available, enabling it turns the problem into this: we can know the TCP connection is dead or not, but will the downstream MySQL/TiDB/RDS responses to the query in future when the TCP connection is not dead? For example, will MySQL/TiDB/RDS sliently drop the query after receiving it? (I guess no but haven't check it by MySQL Client/Server Protocol) Will MySQL/TiDB/RDS drop some query when reading from the socket and treat it as not received because of something? Will the MySQL/TiDB/RDS fail to send COM_QUERY_Response and not retry?

If you can find some proof about above question and correctly set the TCP keepalive feature, I think it's OK to totally remove the ReadTimeout from the application layer.

Another solution is after the ReadTimeout, we can use ADMIN SHOW DDL to check if downstream has really received the query, this is more safe IMO but may need more code work.

Feel free to discuss!

lance6716 avatar Feb 25 '22 04:02 lance6716

"2. add set connection timeout" i mean it is the timeout when dm connecting the downstreams,not like tcp keepalive.

there seems no good method to check downstream alive through mysql protocol by a connection Executing sql query. Is the problem "no COM_QUERY_Response for a long time" occurred from some mysql proxy or lvs?

if we use ADMIN SHOW DDL or query information_schema.ddl_jobs, by what method to judge the dm ddl sql? May be ddl job_id or the query sql or some way else?
hi @lance6716

jiyfhust avatar Feb 25 '22 09:02 jiyfhust

I think it will take a long time to fix it by myself. Maybe some one who is familiar with dm to fix it is a better choice.

If dm syncer a ddl like "modify column", it may trigger a serious TiDB bug which is fixed and mergered to 5.3.0 just three days before.

ddl: fix concurrent column type changes(with changing data) that cause schema and data inconsistencies

jiyfhust avatar Feb 25 '22 10:02 jiyfhust

In fact I haven't experienced "no COM_QUERY_Response for a long time", I guess it can be caused by any components in the network link, for example the router is down.

DM can know the DDL in invalidConnF. Through ADMIN SHOW DDL JOB QUERIES <ID> or other ways it can check if downstream has received the DDL, and through ADMIN SHOW DDL JOBS it can check if the DDL of DM is finished.

Don't worry, any kind of contribution is good!

lance6716 avatar Feb 27 '22 03:02 lance6716