7days-golang
7days-golang copied to clipboard
geerpc 在超时处理方面的 bug
在 geerpc 的 day4-timeout 中执行 go test -v
有一定概率会出现卡死的现象,具体日志如下:
~/ExcellentOriginSource/7days-golang/gee-rpc/day4-timeout$ go test -v
=== RUN TestClient_dialTimeout
=== PAUSE TestClient_dialTimeout
=== RUN TestClient_Call
=== PAUSE TestClient_Call
=== RUN TestNewService
2020/12/07 12:10:47 rpc server: register Foo.Sum
--- PASS: TestNewService (0.00s)
=== RUN TestMethodType_Call
2020/12/07 12:10:47 rpc server: register Foo.Sum
--- PASS: TestMethodType_Call (0.00s)
=== CONT TestClient_dialTimeout
=== CONT TestClient_Call
2020/12/07 12:10:47 rpc server: register Bar.Timeout
=== RUN TestClient_dialTimeout/timeout
=== RUN TestClient_Call/client_timeout
=== RUN TestClient_dialTimeout/0
=== RUN TestClient_Call/server_handle_timeout
--- PASS: TestClient_dialTimeout (3.00s)
--- PASS: TestClient_dialTimeout/timeout (1.00s)
--- PASS: TestClient_dialTimeout/0 (2.00s)
# 在此处失去响应,需要手动中止程序或等待10m后程序崩溃退出
这种情况下,如果不进行中断,则会在 10 分钟后因创建过多 goroutine 而导致崩溃,出现这种问题的 bug 大概是每跑 5 次出现1 次. 我的代码运行环境为 Ubuntu-16.04 LTS, go 的版本为 version go1.15.4 linux/amd64 通过对问题代码进行定位,我发现出问题的地方是 client.go 的 Call 函数 中的 select 语句
func (client *Client) Call(ctx context.Context, serviceMethod string, args, reply interface{}) error {
call := client.Go(serviceMethod, args, reply, make(chan *Call, 1))
select {
case <-ctx.Done():
client.removeCall(call.Seq)
return errors.New("rpc client: call failed: " + ctx.Err().Error())
case call := <-call.Done:
return call.Error
}
}
在 select 语句中会出现 client.done() 未被执行的情况,这种情况下两个 case 都不满足,进而导致测试程序卡死 select 处。这个问题在 day7 处执行 go run main.go
时也有一定概率会出现,结果是导致 main.go 也卡死,陷入无响应状态
@Phoenix500526 非常感谢这个反馈,会尽快定位修复的~
这里分享一下我关于代码的一些定位测试,我在几个函数中加入了打印语句,通过查看日志,我发现导致这个bug 出现的直接原因是 client 的 receive 函数,如下(带有一些我的打印语句):
func (client *Client) receive() {
var err error
for err == nil {
var h codec.Header
log.Println("Testing, receive:client calling ReadHeader")
err = client.cc.ReadHeader(&h)
log.Printf("Testing, receive:err =%v, header = %v\n", err, h)
if err != nil {
log.Println("Testing, receive:000000000000000000000000")
break
}
log.Println("Testing, receive:============================")
call := client.removeCall(h.Seq)
log.Println("Testing, receive:------------------------")
switch {
case call == nil:
log.Println("Testing, receive:111111111111111111111111")
err = client.cc.ReadBody(nil)
case h.Error != "":
call.Error = fmt.Errorf(h.Error)
err = client.cc.ReadBody(nil)
log.Println("Testing, receive:222222222222222222222222")
call.done()
default:
err = client.cc.ReadBody(call.Reply)
if err != nil {
call.Error = errors.New("reading body " + err.Error())
}
log.Println("Testing, receive:333333333333333333333333")
call.done()
}
}
// if error occurs, terminateCalls pending calls
client.terminateCalls(err)
}
对应的 log 如下:
=== RUN TestClient_Call
2020/12/07 14:23:45 rpc server: register Bar.Timeout
=== RUN TestClient_Call/server_handle_timeout
client_test.go:71: server handle timeout:333333333333333
2020/12/07 14:23:46 Testing, Call : 11111111111111111111111111111
2020/12/07 14:23:46 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:46 Testing, receive:client calling ReadHeader
2020/12/07 14:23:51 Testing, Call : 444444444444444444444444444
client_test.go:73: server handle timeout:444444444444444444
--- FAIL: TestClient_Call (6.01s)
--- FAIL: TestClient_Call/server_handle_timeout (5.00s)
panic: assertion failed: expect a timeout error [recovered]
panic: assertion failed: expect a timeout error
goroutine 5 [running]:
testing.tRunner.func1.1(0x600720, 0xc000091770)
/usr/local/go/src/testing/testing.go:1072 +0x30d
testing.tRunner.func1(0xc000001500)
/usr/local/go/src/testing/testing.go:1075 +0x41a
panic(0x600720, 0xc000091770)
/usr/local/go/src/runtime/panic.go:969 +0x1b9
command-line-arguments._assert(...)
/home/phoenix/MyProject/go-rpc/client_test.go:16
command-line-arguments.TestClient_Call.func1(0xc000001500)
/home/phoenix/MyProject/go-rpc/client_test.go:74 +0x337
testing.tRunner(0xc000001500, 0xc00000e040)
/usr/local/go/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1168 +0x2b3
FAIL command-line-arguments 6.014s
FAIL
而成功时候的 log 如下:
=== RUN TestClient_Call
2020/12/07 14:23:42 rpc server: register Bar.Timeout
=== RUN TestClient_Call/server_handle_timeout
client_test.go:71: server handle timeout:333333333333333
2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:43 Testing, receive:client calling ReadHeader
2020/12/07 14:23:43 Testing, Call : 11111111111111111111111111111
2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader
2020/12/07 14:23:43 Testing, handleRequest: timeout = 1s
2020/12/07 14:23:44 handleRequest send timeout
2020/12/07 14:23:44 Testing, receive:err =<nil>, header = {Bar.Timeout 1 rpc server: request handle timeout:expect within 1s}
2020/12/07 14:23:44 Testing, receive:============================
2020/12/07 14:23:44 Testing, receive:------------------------
2020/12/07 14:23:44 Testing, receive:222222222222222222222222
2020/12/07 14:23:44 Testing, done:11111111111111111111111
2020/12/07 14:23:44 Testing, receive:client calling ReadHeader
2020/12/07 14:23:44 Testing, Call : 333333333333333333333333333
client_test.go:73: server handle timeout:444444444444444444
--- PASS: TestClient_Call (2.02s)
--- PASS: TestClient_Call/server_handle_timeout (1.01s)
PASS
ok command-line-arguments 2.022s
可以看到在成功的情况下,receive 函数会被顺利的调用,并得到读取后的 header 的信息。而失败的情况下,程序执行到 err = client.cc.ReadHeader(&h)
就不动了。我看到 ReadHeader 中并没有阻塞的操作,不太清楚为何会卡住。
另外,我发现加入 log 打印以后,会使得出现这个 bug 的几率降低。猜测可能是因为缺少某种同步机制。以上是我调试后得出的一些结论,应该有助于博主 debug
@Phoenix500526 我跑的时候也遇到这个问题了,大佬找到根源所在了吗?
本地一直没有复现,不过这一块代码中的确有处理不好的地方。比如客户端和服务端为了检测超时,开启了新的协程,但是超时时,并没有机制关掉这些协程,会一直阻塞,占用资源。
这里分享一下我关于代码的一些定位测试,我在几个函数中加入了打印语句,通过查看日志,我发现导致这个bug 出现的直接原因是 client 的 receive 函数,如下(带有一些我的打印语句):
func (client *Client) receive() { var err error for err == nil { var h codec.Header log.Println("Testing, receive:client calling ReadHeader") err = client.cc.ReadHeader(&h) log.Printf("Testing, receive:err =%v, header = %v\n", err, h) if err != nil { log.Println("Testing, receive:000000000000000000000000") break } log.Println("Testing, receive:============================") call := client.removeCall(h.Seq) log.Println("Testing, receive:------------------------") switch { case call == nil: log.Println("Testing, receive:111111111111111111111111") err = client.cc.ReadBody(nil) case h.Error != "": call.Error = fmt.Errorf(h.Error) err = client.cc.ReadBody(nil) log.Println("Testing, receive:222222222222222222222222") call.done() default: err = client.cc.ReadBody(call.Reply) if err != nil { call.Error = errors.New("reading body " + err.Error()) } log.Println("Testing, receive:333333333333333333333333") call.done() } } // if error occurs, terminateCalls pending calls client.terminateCalls(err) }
对应的 log 如下:
=== RUN TestClient_Call 2020/12/07 14:23:45 rpc server: register Bar.Timeout === RUN TestClient_Call/server_handle_timeout client_test.go:71: server handle timeout:333333333333333 2020/12/07 14:23:46 Testing, Call : 11111111111111111111111111111 2020/12/07 14:23:46 Testing readRequestHeader: calling ReadHeader 2020/12/07 14:23:46 Testing, receive:client calling ReadHeader 2020/12/07 14:23:51 Testing, Call : 444444444444444444444444444 client_test.go:73: server handle timeout:444444444444444444 --- FAIL: TestClient_Call (6.01s) --- FAIL: TestClient_Call/server_handle_timeout (5.00s) panic: assertion failed: expect a timeout error [recovered] panic: assertion failed: expect a timeout error goroutine 5 [running]: testing.tRunner.func1.1(0x600720, 0xc000091770) /usr/local/go/src/testing/testing.go:1072 +0x30d testing.tRunner.func1(0xc000001500) /usr/local/go/src/testing/testing.go:1075 +0x41a panic(0x600720, 0xc000091770) /usr/local/go/src/runtime/panic.go:969 +0x1b9 command-line-arguments._assert(...) /home/phoenix/MyProject/go-rpc/client_test.go:16 command-line-arguments.TestClient_Call.func1(0xc000001500) /home/phoenix/MyProject/go-rpc/client_test.go:74 +0x337 testing.tRunner(0xc000001500, 0xc00000e040) /usr/local/go/src/testing/testing.go:1123 +0xef created by testing.(*T).Run /usr/local/go/src/testing/testing.go:1168 +0x2b3 FAIL command-line-arguments 6.014s FAIL
而成功时候的 log 如下:
=== RUN TestClient_Call 2020/12/07 14:23:42 rpc server: register Bar.Timeout === RUN TestClient_Call/server_handle_timeout client_test.go:71: server handle timeout:333333333333333 2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader 2020/12/07 14:23:43 Testing, receive:client calling ReadHeader 2020/12/07 14:23:43 Testing, Call : 11111111111111111111111111111 2020/12/07 14:23:43 Testing readRequestHeader: calling ReadHeader 2020/12/07 14:23:43 Testing, handleRequest: timeout = 1s 2020/12/07 14:23:44 handleRequest send timeout 2020/12/07 14:23:44 Testing, receive:err =<nil>, header = {Bar.Timeout 1 rpc server: request handle timeout:expect within 1s} 2020/12/07 14:23:44 Testing, receive:============================ 2020/12/07 14:23:44 Testing, receive:------------------------ 2020/12/07 14:23:44 Testing, receive:222222222222222222222222 2020/12/07 14:23:44 Testing, done:11111111111111111111111 2020/12/07 14:23:44 Testing, receive:client calling ReadHeader 2020/12/07 14:23:44 Testing, Call : 333333333333333333333333333 client_test.go:73: server handle timeout:444444444444444444 --- PASS: TestClient_Call (2.02s) --- PASS: TestClient_Call/server_handle_timeout (1.01s) PASS ok command-line-arguments 2.022s
可以看到在成功的情况下,receive 函数会被顺利的调用,并得到读取后的 header 的信息。而失败的情况下,程序执行到
err = client.cc.ReadHeader(&h)
就不动了。我看到 ReadHeader 中并没有阻塞的操作,不太清楚为何会卡住。 另外,我发现加入 log 打印以后,会使得出现这个 bug 的几率降低。猜测可能是因为缺少某种同步机制。以上是我调试后得出的一些结论,应该有助于博主 debug
我也遇到这个问题,一直卡死在GobCodec的ReadHeader里无法返回,不知为何?
同样遇到这个问题了。。。
‘rpc server: read header error: gob: unknown type id or corrupted data’ 偶尔会报这个错,不知道gob出了什么问题
第一节提到gob与json编解码器用法几乎一样,于是我增加了一个json的opt,发现换成json编解码器后卡死几率大幅提升
我觉得这个问题就是序列化和反序列化的问题,可以参考#34 测试时发现client将请求发送出去了,但server没有收到。可以考虑不只是client发送option的json数据,而是双方进行一次握手后再发送gob数据。
我发现单独测试HandleTimeout,已经调用了call.done,Call方法中依然无法从call.Done中读取到call,从打印中也可以看到Done是相同的;
=== RUN TestClient_Call
=== PAUSE TestClient_Call
=== CONT TestClient_Call
2021/09/14 15:34:25 rpc server: register Bar.Timeout
=== RUN TestClient_Call/server_handle_timeout
Testing >>>>>> receive
Testing >>>>>> ServeConn
Testing >>>>>> ServeCodec
Testing >>>>>> read request
Testing >>>>>> handle request
Testing >>>>>> handleRequest
Testing >>>>>> call
Testing >>>>>> send on called
Testing >>>>>> called
Testing >>>>>> sendResponse end
done call:&{Seq:1 ServiceMethod:Bar.Timeout Args:1 Reply:0xc00011a9c0 Error:<nil> Done:0xc000164480}
Testing >>>>>> done
Testing >>>>>> readbody
call:&{Seq:1 ServiceMethod:Bar.Timeout Args:1 Reply:0xc00011a9c0 Error:<nil> Done:0xc000164480}
‘rpc server: read header error: gob: unknown type id or corrupted data’ 偶尔会报这个错,不知道gob出了什么问题
应该是 tcp 粘包了。我在 NewClient 的时候,服务器接收到 Option 之后,让服务器给客户端一个响应,即进行两次握手之后,这个问题就没出现过了。如下图所示:
- 服务端
- 客户端
day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?
- 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) {
defer wg.Done()
called := make(chan struct{})
sent := make(chan struct{})
go func() {
err := req.svc.call(req.mtype, req.argv, req.replyv)
called <- struct{}{}
if err != nil {
req.h.Error = err.Error()
server.sendResponse(cc, req.h, invalidRequest, sending)
sent <- struct{}{}
return
}
server.sendResponse(cc, req.h, req.replyv.Interface(), sending)
sent <- struct{}{}
}()
if timeout == 0 {
<-called
<-sent
return
}
select {
case <-time.After(timeout):
req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout)
server.sendResponse(cc, req.h, invalidRequest, sending)
case <-called:
<-sent
}
}
day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?
- 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) { defer wg.Done() called := make(chan struct{}) sent := make(chan struct{}) go func() { err := req.svc.call(req.mtype, req.argv, req.replyv) called <- struct{}{} if err != nil { req.h.Error = err.Error() server.sendResponse(cc, req.h, invalidRequest, sending) sent <- struct{}{} return } server.sendResponse(cc, req.h, req.replyv.Interface(), sending) sent <- struct{}{} }() if timeout == 0 { <-called <-sent return } select { case <-time.After(timeout): req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout) server.sendResponse(cc, req.h, invalidRequest, sending) case <-called: <-sent } }
这个是不是把chan设置为有缓冲的就可以避免阻塞了?
day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?
- 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) { defer wg.Done() called := make(chan struct{}) sent := make(chan struct{}) go func() { err := req.svc.call(req.mtype, req.argv, req.replyv) called <- struct{}{} if err != nil { req.h.Error = err.Error() server.sendResponse(cc, req.h, invalidRequest, sending) sent <- struct{}{} return } server.sendResponse(cc, req.h, req.replyv.Interface(), sending) sent <- struct{}{} }() if timeout == 0 { <-called <-sent return } select { case <-time.After(timeout): req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout) server.sendResponse(cc, req.h, invalidRequest, sending) case <-called: <-sent } }
这个是不是把chan设置为有缓冲的就可以避免阻塞了?
我理解可以,只要设置成带有缓冲的就行了。不过无法避免发送不一致的消息(第一次可能发送了超时的响应, 后面又发送了正常的响应给客户端)
day4中超时处理问题还没有修复吗? 下面这个代码我认为会导致goroutine泄露吧?
- 如果超时了,后面没有人会读取called, sent中的内容,导致goroutine一直无法结束。
func (server *Server) handleRequest(cc codec.Codec, req *request, sending *sync.Mutex, wg *sync.WaitGroup, timeout time.Duration) { defer wg.Done() called := make(chan struct{}) sent := make(chan struct{}) go func() { err := req.svc.call(req.mtype, req.argv, req.replyv) called <- struct{}{} if err != nil { req.h.Error = err.Error() server.sendResponse(cc, req.h, invalidRequest, sending) sent <- struct{}{} return } server.sendResponse(cc, req.h, req.replyv.Interface(), sending) sent <- struct{}{} }() if timeout == 0 { <-called <-sent return } select { case <-time.After(timeout): req.h.Error = fmt.Sprintf("rpc server: request handle timeout: expect within %s", timeout) server.sendResponse(cc, req.h, invalidRequest, sending) case <-called: <-sent } }
这个是不是把chan设置为有缓冲的就可以避免阻塞了?
我理解可以,只要设置成带有缓冲的就行了。不过无法避免发送不一致的消息(第一次可能发送了超时的响应, 后面又发送了正常的响应给客户端)
为什么会多次发送响应呢?如果超时发生,发送超时响应给客户端,并退出select块。不会再有gorutine等待called和sent,后面写入这两个chan,看起来不会有任何side effects。所以不可能再次发送正常响应给客户端。对吗?