pd icon indicating copy to clipboard operation
pd copied to clipboard

persist resource group state failed

Open geeklc opened this issue 3 months ago • 14 comments

Bug Report

What did you do?

The PD leader logs occasionally report some errors:

Image

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

v8.5.1

geeklc avatar Sep 28 '25 08:09 geeklc

Welcome @geeklc! It looks like this is your first issue to tikv/pd 🎉

ti-chi-bot[bot] avatar Sep 28 '25 08:09 ti-chi-bot[bot]

Can you provide more logs from this time point, including TiDB's log? Also, what operations were performed at this time point?

okJiang avatar Sep 29 '25 03:09 okJiang

do nothing in this time,The error log is continuously printed。

geeklc avatar Sep 29 '25 08:09 geeklc

[2025/09/29 15:00:13.442 +08:00] [ERROR] [manager.go:352] ["persist resource group state failed"] [error="[PD:json:ErrJSONMarshal]failed to marshal json: json: unsupported value: NaN"]

Is this the first time the error occurred? I would appreciate the logs before and after the first error. @geeklc

okJiang avatar Sep 30 '25 03:09 okJiang

the first time error logs : 2025-09.zip pd0919.log

geeklc avatar Sep 30 '25 06:09 geeklc

I noticed PD leader resigned in

[2025/09/19 21:53:13.465 +08:00] [INFO] [server.go:1768] ["PD leader is ready to serve"] [leader-name=pd-2]

Before PD leader transfer, did this error occur in the previous PD leader? I need to check if this is the first time the error occurs.

okJiang avatar Sep 30 '25 08:09 okJiang

this fist error in pd logs,and the tidb logs have been cleared pd0809_37.log

geeklc avatar Sep 30 '25 09:09 geeklc

From the latest log you provided, I can see that the error suddenly appeared on August 9th. I suspect it was triggered by a create/alter resource group operation at that time. Could you provide the Resource Group settings you used? This would be more helpful for further investigation. @geeklc

[2025/08/09 17:04:49.004 +08:00] [INFO] [grpc_service.go:100] ["watch request"] [key=resource_group/settings]

okJiang avatar Oct 09 '25 03:10 okJiang

I can’t quite remember the exact operations at that time; below is the current resource group:

Image Image

geeklc avatar Oct 09 '25 09:10 geeklc

我也遇到了这个问题,也是由于资源组的使用导致,进程已经运行了大概半个月 I also encountered this problem, which was also caused by the use of the resource group. The resource group has been running for about half a month.

version: v7.1.1

panic: json: unsupported value: NaN

goroutine 226125595 [running]:
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*ResourceGroup).Copy(0x2a43780?)
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/resource_group.go:68 +0x12c
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*Manager).GetResourceGroup(0xc019ecf710?, {0xc160135158?, 0x2b237e0?})
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/manager.go:225 +0xc5
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*Service).GetResourceGroup(0xc0015b9cc8?, {0xc0002177d0?, 0xc08912b6e0?}, 0xc08912b5c0?)
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/grpc_service.go:100 +0x8a
github.com/pingcap/kvproto/pkg/resource_manager._ResourceManager_GetResourceGroup_Handler.func1({0x3ac01d8, 0xc08912b590}, {0x2c701e0?, 0xc08912b5c0})
	/go/pkg/mod/github.com/pingcap/[email protected]/pkg/resource_manager/resource_manager.pb.go:1886 +0x78
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3ac01d8?, 0xc08912b590?}, {0x2c701e0?, 0xc08912b5c0?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:31 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1({0x3ac01d8, 0xc08912b590}, {0x2c701e0, 0xc08912b5c0}, 0xc04d12a540?, 0xc160178050)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:107 +0x87
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3ac01d8?, 0xc08912b590?}, {0x2c701e0?, 0xc08912b5c0?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0x6f
go.etcd.io/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1({0x3ac01d8, 0xc08912b590}, {0x2c701e0?, 0xc08912b5c0}, 0x0?, 0xc160178050)
	/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/api/v3rpc/interceptor.go:70 +0x2a2
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3ac01d8?, 0xc08912b590?}, {0x2c701e0?, 0xc08912b5c0?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0x6f
go.etcd.io/etcd/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1({0x3ac01d8, 0xc08912b590}, {0x2c701e0, 0xc08912b5c0}, 0xc130bd2060, 0xc160178050)
	/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/api/v3rpc/interceptor.go:77 +0xc3
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x3ac01d8, 0xc08912b590}, {0x2c701e0, 0xc08912b5c0}, 0xc130bd2060, 0xc04b879938)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:39 +0x1a3
github.com/pingcap/kvproto/pkg/resource_manager._ResourceManager_GetResourceGroup_Handler({0x2c193a0?, 0xc0015b9cc8}, {0x3ac01d8, 0xc08912b590}, 0xc16014ec00, 0xc0038d0060)
	/go/pkg/mod/github.com/pingcap/[email protected]/pkg/resource_manager/resource_manager.pb.go:1888 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0038bcc00, {0x3acf7e0, 0xc02531e480}, 0xc1190d8b00, 0xc0038dc8d0, 0x4d27598, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1024 +0xd5e
google.golang.org/grpc.(*Server).handleStream(0xc0038bcc00, {0x3acf7e0, 0xc02531e480}, 0xc1190d8b00, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1313 +0xa25
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/go/pkg/mod/google.golang.org/[email protected]/server.go:722 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/[email protected]/server.go:720 +0xea
[2025/10/10 10:44:08.907 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-97396e1f-0c9c-4847-8610-9066d0a607f3/172.31.102.49:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
Image

NoskyOrg avatar Oct 10 '25 02:10 NoskyOrg

Please help to take a look @JmPotato @nolouch @glorv

okJiang avatar Oct 10 '25 08:10 okJiang

similar to #7206, there must be a race condition that read&update the resource group at the same time.

glorv avatar Oct 10 '25 11:10 glorv

similar to #7206, there must be a race condition that read&update the resource group at the same time.

Wasn’t #7206 already fixed in version v7? Are there still related issues in v8.5.1? Should we confirm the scope of impact?

geeklc avatar Oct 11 '25 01:10 geeklc