7days-golang icon indicating copy to clipboard operation
7days-golang copied to clipboard

geerpc 在超时处理方面的 bug

Open Phoenix500526 opened this issue 3 years ago • 15 comments

在 geerpc 的 day4-timeout 中执行 go test -v 有一定概率会出现卡死的现象,具体日志如下:

~/ExcellentOriginSource/7days-golang/gee-rpc/day4-timeout$ go test -v
=== RUN   TestClient_dialTimeout
=== PAUSE TestClient_dialTimeout
=== RUN   TestClient_Call
=== PAUSE TestClient_Call
=== RUN   TestNewService
2020/12/07 12:10:47 rpc server: register Foo.Sum
--- PASS: TestNewService (0.00s)
=== RUN   TestMethodType_Call
2020/12/07 12:10:47 rpc server: register Foo.Sum
--- PASS: TestMethodType_Call (0.00s)
=== CONT  TestClient_dialTimeout
=== CONT  TestClient_Call
2020/12/07 12:10:47 rpc server: register Bar.Timeout
=== RUN   TestClient_dialTimeout/timeout
=== RUN   TestClient_Call/client_timeout
=== RUN   TestClient_dialTimeout/0
=== RUN   TestClient_Call/server_handle_timeout
--- PASS: TestClient_dialTimeout (3.00s)
    --- PASS: TestClient_dialTimeout/timeout (1.00s)
    --- PASS: TestClient_dialTimeout/0 (2.00s)
# 在此处失去响应,需要手动中止程序或等待10m后程序崩溃退出

这种情况下,如果不进行中断,则会在 10 分钟后因创建过多 goroutine 而导致崩溃,出现这种问题的 bug 大概是每跑 5 次出现1 次. 我的代码运行环境为 Ubuntu-16.04 LTS, go 的版本为 version go1.15.4 linux/amd64 通过对问题代码进行定位,我发现出问题的地方是 client.go 的 Call 函数 中的 select 语句

func (client *Client) Call(ctx context.Context, serviceMethod string, args, reply interface{}) error {
	call := client.Go(serviceMethod, args, reply, make(chan *Call, 1))
	select {
	case <-ctx.Done():
		client.removeCall(call.Seq)
		return errors.New("rpc client: call failed: " + ctx.Err().Error())
	case call := <-call.Done:
		return call.Error
       }
}

在 select 语句中会出现 client.done() 未被执行的情况,这种情况下两个 case 都不满足,进而导致测试程序卡死 select 处。这个问题在 day7 处执行 go run main.go 时也有一定概率会出现,结果是导致 main.go 也卡死,陷入无响应状态

Phoenix500526 avatar Dec 07 '20 04:12 Phoenix500526

@Phoenix500526 非常感谢这个反馈,会尽快定位修复的~

geektutu avatar Dec 07 '20 05:12 geektutu

这里分享一下我关于代码的一些定位测试,我在几个函数中加入了打印语句,通过查看日志,我发现导致这个bug 出现的直接原因是 client 的 receive 函数,如下(带有一些我的打印语句):

func (client *Client) receive() {
	var err error
	for err == nil {
		var h codec.Header
		log.Println("Testing, receive:client calling ReadHeader")
		err = client.cc.ReadHeader(&h)
		log.Printf("Testing, receive:err =%v, header = %v\n", err, h)
		if err != nil {
			log.Println("Testing, receive:000000000000000000000000")
			break
		}
		log.Println("Testing, receive:============================")
		call := client.removeCall(h.Seq)
		log.Println("Testing, receive:------------------------")
		switch {
		case call == nil:
			log.Println("Testing, receive:111111111111111111111111")
			err = client.cc.ReadBody(nil)
		case h.Error != "":
			call.Error = fmt.Errorf(h.Error)
			err = client.cc.ReadBody(nil)
			log.Println("Testing, receive:222222222222222222222222")
			call.done()
		default:
			err = client.cc.ReadBody(call.Reply)
			if err != nil {
				call.Error = errors.New("reading body " + err.Error())
			}
			log.Println("Testing, receive:333333333333333333333333")
			call.done()
		}
	}
	// if error occurs, terminateCalls pending calls
	client.terminateCalls(err)
}

对应的 log 如下:

=== RUN   TestClient_Call
2020/12/07 14:23:45 rpc server: register Bar.Timeout
=== RUN   TestClient_Call/server_handle_timeout
    client_test.go:71: server handle timeout:333333333333333
2020/12/07 14:23:46 Testing, Call : 11111111111111111111111111111
2020/12/07 14:23:46 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:46 Testing, receive:client calling ReadHeader
2020/12/07 14:23:51 Testing, Call : 444444444444444444444444444
    client_test.go:73: server handle timeout:444444444444444444
--- FAIL: TestClient_Call (6.01s)
    --- FAIL: TestClient_Call/server_handle_timeout (5.00s)
panic: assertion failed: expect a timeout error [recovered]
	panic: assertion failed: expect a timeout error

goroutine 5 [running]:
testing.tRunner.func1.1(0x600720, 0xc000091770)
	/usr/local/go/src/testing/testing.go:1072 +0x30d
testing.tRunner.func1(0xc000001500)
	/usr/local/go/src/testing/testing.go:1075 +0x41a
panic(0x600720, 0xc000091770)
	/usr/local/go/src/runtime/panic.go:969 +0x1b9
command-line-arguments._assert(...)
	/home/phoenix/MyProject/go-rpc/client_test.go:16
command-line-arguments.TestClient_Call.func1(0xc000001500)
	/home/phoenix/MyProject/go-rpc/client_test.go:74 +0x337
testing.tRunner(0xc000001500, 0xc00000e040)
	/usr/local/go/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1168 +0x2b3
FAIL	command-line-arguments	6.014s
FAIL

而成功时候的 log 如下:

=== RUN   TestClient_Call
2020/12/07 14:23:42 rpc server: register Bar.Timeout
=== RUN   TestClient_Call/server_handle_timeout
    client_test.go:71: server handle timeout:333333333333333
2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:43 Testing, receive:client calling ReadHeader
2020/12/07 14:23:43 Testing, Call : 11111111111111111111111111111
2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:43 Testing, handleRequest: timeout =  1s
2020/12/07 14:23:44 handleRequest send timeout
2020/12/07 14:23:44 Testing, receive:err =<nil>, header = {Bar.Timeout 1 rpc server: request handle timeout:expect within 1s}
2020/12/07 14:23:44 Testing, receive:============================
2020/12/07 14:23:44 Testing, receive:------------------------
2020/12/07 14:23:44 Testing, receive:222222222222222222222222
2020/12/07 14:23:44 Testing, done:11111111111111111111111
2020/12/07 14:23:44 Testing, receive:client calling ReadHeader
2020/12/07 14:23:44 Testing, Call : 333333333333333333333333333
    client_test.go:73: server handle timeout:444444444444444444
--- PASS: TestClient_Call (2.02s)
    --- PASS: TestClient_Call/server_handle_timeout (1.01s)
PASS
ok  	command-line-arguments	2.022s

可以看到在成功的情况下,receive 函数会被顺利的调用,并得到读取后的 header 的信息。而失败的情况下,程序执行到 err = client.cc.ReadHeader(&h) 就不动了。我看到 ReadHeader 中并没有阻塞的操作,不太清楚为何会卡住。 另外,我发现加入 log 打印以后,会使得出现这个 bug 的几率降低。猜测可能是因为缺少某种同步机制。以上是我调试后得出的一些结论,应该有助于博主 debug

Phoenix500526 avatar Dec 07 '20 06:12 Phoenix500526

@Phoenix500526 我跑的时候也遇到这个问题了,大佬找到根源所在了吗?

DiangD avatar Dec 10 '20 07:12 DiangD

本地一直没有复现,不过这一块代码中的确有处理不好的地方。比如客户端和服务端为了检测超时,开启了新的协程,但是超时时,并没有机制关掉这些协程,会一直阻塞,占用资源。

geektutu avatar Dec 10 '20 17:12 geektutu

这里分享一下我关于代码的一些定位测试,我在几个函数中加入了打印语句,通过查看日志,我发现导致这个bug 出现的直接原因是 client 的 receive 函数,如下(带有一些我的打印语句):

func (client *Client) receive() {
	var err error
	for err == nil {
		var h codec.Header
		log.Println("Testing, receive:client calling ReadHeader")
		err = client.cc.ReadHeader(&h)
		log.Printf("Testing, receive:err =%v, header = %v\n", err, h)
		if err != nil {
			log.Println("Testing, receive:000000000000000000000000")
			break
		}
		log.Println("Testing, receive:============================")
		call := client.removeCall(h.Seq)
		log.Println("Testing, receive:------------------------")
		switch {
		case call == nil:
			log.Println("Testing, receive:111111111111111111111111")
			err = client.cc.ReadBody(nil)
		case h.Error != "":
			call.Error = fmt.Errorf(h.Error)
			err = client.cc.ReadBody(nil)
			log.Println("Testing, receive:222222222222222222222222")
			call.done()
		default:
			err = client.cc.ReadBody(call.Reply)
			if err != nil {
				call.Error = errors.New("reading body " + err.Error())
			}
			log.Println("Testing, receive:333333333333333333333333")
			call.done()
		}
	}
	// if error occurs, terminateCalls pending calls
	client.terminateCalls(err)
}

对应的 log 如下:

=== RUN   TestClient_Call
2020/12/07 14:23:45 rpc server: register Bar.Timeout
=== RUN   TestClient_Call/server_handle_timeout
    client_test.go:71: server handle timeout:333333333333333
2020/12/07 14:23:46 Testing, Call : 11111111111111111111111111111
2020/12/07 14:23:46 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:46 Testing, receive:client calling ReadHeader
2020/12/07 14:23:51 Testing, Call : 444444444444444444444444444
    client_test.go:73: server handle timeout:444444444444444444
--- FAIL: TestClient_Call (6.01s)
    --- FAIL: TestClient_Call/server_handle_timeout (5.00s)
panic: assertion failed: expect a timeout error [recovered]
	panic: assertion failed: expect a timeout error

goroutine 5 [running]:
testing.tRunner.func1.1(0x600720, 0xc000091770)
	/usr/local/go/src/testing/testing.go:1072 +0x30d
testing.tRunner.func1(0xc000001500)
	/usr/local/go/src/testing/testing.go:1075 +0x41a
panic(0x600720, 0xc000091770)
	/usr/local/go/src/runtime/panic.go:969 +0x1b9
command-line-arguments._assert(...)
	/home/phoenix/MyProject/go-rpc/client_test.go:16
command-line-arguments.TestClient_Call.func1(0xc000001500)
	/home/phoenix/MyProject/go-rpc/client_test.go:74 +0x337
testing.tRunner(0xc000001500, 0xc00000e040)
	/usr/local/go/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1168 +0x2b3
FAIL	command-line-arguments	6.014s
FAIL

而成功时候的 log 如下:

=== RUN   TestClient_Call
2020/12/07 14:23:42 rpc server: register Bar.Timeout
=== RUN   TestClient_Call/server_handle_timeout
    client_test.go:71: server handle timeout:333333333333333
2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:43 Testing, receive:client calling ReadHeader
2020/12/07 14:23:43 Testing, Call : 11111111111111111111111111111
2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:43 Testing, handleRequest: timeout =  1s
2020/12/07 14:23:44 handleRequest send timeout
2020/12/07 14:23:44 Testing, receive:err =<nil>, header = {Bar.Timeout 1 rpc server: request handle timeout:expect within 1s}
2020/12/07 14:23:44 Testing, receive:============================
2020/12/07 14:23:44 Testing, receive:------------------------
2020/12/07 14:23:44 Testing, receive:222222222222222222222222
2020/12/07 14:23:44 Testing, done:11111111111111111111111
2020/12/07 14:23:44 Testing, receive:client calling ReadHeader
2020/12/07 14:23:44 Testing, Call : 333333333333333333333333333
    client_test.go:73: server handle timeout:444444444444444444
--- PASS: TestClient_Call (2.02s)
    --- PASS: TestClient_Call/server_handle_timeout (1.01s)
PASS
ok  	command-line-arguments	2.022s

可以看到在成功的情况下,receive 函数会被顺利的调用,并得到读取后的 header 的信息。而失败的情况下,程序执行到 err = client.cc.ReadHeader(&h) 就不动了。我看到 ReadHeader 中并没有阻塞的操作,不太清楚为何会卡住。 另外,我发现加入 log 打印以后,会使得出现这个 bug 的几率降低。猜测可能是因为缺少某种同步机制。以上是我调试后得出的一些结论,应该有助于博主 debug

我也遇到这个问题,一直卡死在GobCodec的ReadHeader里无法返回,不知为何?

yeahxian avatar Dec 19 '20 09:12 yeahxian

同样遇到这个问题了。。。

andcarefree avatar Jan 22 '21 15:01 andcarefree

‘rpc server: read header error: gob: unknown type id or corrupted data’ 偶尔会报这个错,不知道gob出了什么问题

andcarefree avatar Jan 22 '21 15:01 andcarefree

第一节提到gob与json编解码器用法几乎一样,于是我增加了一个json的opt,发现换成json编解码器后卡死几率大幅提升

andcarefree avatar Jan 25 '21 15:01 andcarefree

我觉得这个问题就是序列化和反序列化的问题,可以参考#34 测试时发现client将请求发送出去了,但server没有收到。可以考虑不只是client发送option的json数据,而是双方进行一次握手后再发送gob数据。

shengxiang19 avatar May 29 '21 11:05 shengxiang19

我发现单独测试HandleTimeout,已经调用了call.done,Call方法中依然无法从call.Done中读取到call,从打印中也可以看到Done是相同的;

=== RUN   TestClient_Call
=== PAUSE TestClient_Call
=== CONT  TestClient_Call
2021/09/14 15:34:25 rpc server: register Bar.Timeout
=== RUN   TestClient_Call/server_handle_timeout
Testing >>>>>> receive
Testing >>>>>> ServeConn
Testing >>>>>> ServeCodec
Testing >>>>>> read request
Testing >>>>>> handle request
Testing >>>>>> handleRequest
Testing >>>>>> call
Testing >>>>>> send on called
Testing >>>>>> called
Testing >>>>>> sendResponse end
done call:&{Seq:1 ServiceMethod:Bar.Timeout Args:1 Reply:0xc00011a9c0 Error:<nil> Done:0xc000164480}
Testing >>>>>> done
Testing >>>>>> readbody
call:&{Seq:1 ServiceMethod:Bar.Timeout Args:1 Reply:0xc00011a9c0 Error:<nil> Done:0xc000164480}

cyj19 avatar Sep 14 '21 07:09 cyj19

‘rpc server: read header error: gob: unknown type id or corrupted data’ 偶尔会报这个错,不知道gob出了什么问题

应该是 tcp 粘包了。我在 NewClient 的时候,服务器接收到 Option 之后,让服务器给客户端一个响应,即进行两次握手之后,这个问题就没出现过了。如下图所示:

  • 服务端 image
  • 客户端 image

cuglaiyp avatar Jan 06 '22 09:01 cuglaiyp

day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?

  • 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) {
	defer wg.Done()
	called := make(chan struct{})
	sent := make(chan struct{})
	go func() {
		err := req.svc.call(req.mtype, req.argv, req.replyv)
		called <- struct{}{}
		if err != nil {
			req.h.Error = err.Error()
			server.sendResponse(cc, req.h, invalidRequest, sending)
			sent <- struct{}{}
			return
		}
		server.sendResponse(cc, req.h, req.replyv.Interface(), sending)
		sent <- struct{}{}
	}()

	if timeout == 0 {
		<-called
		<-sent
		return
	}
	select {
	case <-time.After(timeout):
		req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout)
		server.sendResponse(cc, req.h, invalidRequest, sending)
	case <-called:
		<-sent
	}
}

chang-you-ren8 avatar Aug 04 '23 05:08 chang-you-ren8

day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?

  • 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) {
	defer wg.Done()
	called := make(chan struct{})
	sent := make(chan struct{})
	go func() {
		err := req.svc.call(req.mtype, req.argv, req.replyv)
		called <- struct{}{}
		if err != nil {
			req.h.Error = err.Error()
			server.sendResponse(cc, req.h, invalidRequest, sending)
			sent <- struct{}{}
			return
		}
		server.sendResponse(cc, req.h, req.replyv.Interface(), sending)
		sent <- struct{}{}
	}()

	if timeout == 0 {
		<-called
		<-sent
		return
	}
	select {
	case <-time.After(timeout):
		req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout)
		server.sendResponse(cc, req.h, invalidRequest, sending)
	case <-called:
		<-sent
	}
}

这个是不是把chan设置为有缓冲的就可以避免阻塞了?

lee-could avatar Aug 19 '23 09:08 lee-could

day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?

  • 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) {
	defer wg.Done()
	called := make(chan struct{})
	sent := make(chan struct{})
	go func() {
		err := req.svc.call(req.mtype, req.argv, req.replyv)
		called <- struct{}{}
		if err != nil {
			req.h.Error = err.Error()
			server.sendResponse(cc, req.h, invalidRequest, sending)
			sent <- struct{}{}
			return
		}
		server.sendResponse(cc, req.h, req.replyv.Interface(), sending)
		sent <- struct{}{}
	}()

	if timeout == 0 {
		<-called
		<-sent
		return
	}
	select {
	case <-time.After(timeout):
		req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout)
		server.sendResponse(cc, req.h, invalidRequest, sending)
	case <-called:
		<-sent
	}
}

这个是不是把chan设置为有缓冲的就可以避免阻塞了?

我理解可以,只要设置成带有缓冲的就行了。不过无法避免发送不一致的消息(第一次可能发送了超时的响应, 后面又发送了正常的响应给客户端)

chang-you-ren8 avatar Aug 19 '23 09:08 chang-you-ren8

day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?

  • 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) {
	defer wg.Done()
	called := make(chan struct{})
	sent := make(chan struct{})
	go func() {
		err := req.svc.call(req.mtype, req.argv, req.replyv)
		called <- struct{}{}
		if err != nil {
			req.h.Error = err.Error()
			server.sendResponse(cc, req.h, invalidRequest, sending)
			sent <- struct{}{}
			return
		}
		server.sendResponse(cc, req.h, req.replyv.Interface(), sending)
		sent <- struct{}{}
	}()

	if timeout == 0 {
		<-called
		<-sent
		return
	}
	select {
	case <-time.After(timeout):
		req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout)
		server.sendResponse(cc, req.h, invalidRequest, sending)
	case <-called:
		<-sent
	}
}

这个是不是把chan设置为有缓冲的就可以避免阻塞了?

我理解可以,只要设置成带有缓冲的就行了。不过无法避免发送不一致的消息(第一次可能发送了超时的响应, 后面又发送了正常的响应给客户端)

为什么会多次发送响应呢?如果超时发生,发送超时响应给客户端,并退出select块。不会再有gorutine等待called和sent,后面写入这两个chan,看起来不会有任何side effects。所以不可能再次发送正常响应给客户端。对吗?

frankguo77 avatar Oct 18 '23 03:10 frankguo77