[Performance]: memcache with rc bytes 导致tp以及ap性能下降太多
Is there an existing issue for performance?
- [X] I have checked the existing issues.
Environment
在129环境上测试,tpch100g q1, memcache设置成32g或者2g都不影响测试结果。
在主分支上测试,q1平均21秒,revert掉memcache那个 pr之后,平均性能是8秒。性能差距太明显
Details of Performance
在主分支上测试,q1平均21秒 [root@mo-srv-129 mo-tpch]# ./run.sh -q q1 -s 100 -t 3 The times that run tpch test for : 3 2024-04-08 14:13:05 This test will be run for 3 times 2024-04-08 14:13:05 The 1 turn test has started, please wait....... 2024-04-08 14:13:05 Now start to execute the query q1,please wait.....
2024-04-08 14:13:21 The query_id is: 018ebc56-bf0d-75e7-b661-5ab91496945e 2024-04-08 14:13:22 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:13:22 The query q1 has be executed successfully,and cost: 16604 2024-04-08 14:13:22 The 2 turn test has started, please wait....... 2024-04-08 14:13:22 Now start to execute the query q1,please wait.....
2024-04-08 14:13:44 The query_id is: 018ebc56-fffc-774a-993b-6644ab49f03f 2024-04-08 14:13:44 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:13:44 The query q1 has be executed successfully,and cost: 22835 2024-04-08 14:13:44 The 3 turn test has started, please wait....... 2024-04-08 14:13:44 Now start to execute the query q1,please wait.....
2024-04-08 14:14:08 The query_id is: 018ebc57-595b-7541-a3ba-1e979632bf6b 2024-04-08 14:14:08 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:14:08 The query q1 has be executed successfully,and cost: 23826 2024-04-08 14:14:08 This test has been executed successfully 2024-04-08 14:14:08 The sum cost of turn[1] is 16604 ms. 2024-04-08 14:14:08 The sum cost of turn[2] is 22835 ms. 2024-04-08 14:14:08 The sum cost of turn[3] is 23826 ms. 2024-04-08 14:14:08 The avg cost of all turns is 21088 ms.
git revert a4b312c83f81e10d5946900a3bf1590dd9f72884 Replace the memory cache with rc bytes (#14641) revert掉该pr之后,平均性能是8秒。性能差距太明显 [root@mo-srv-129 mo-tpch]# ./run.sh -q q1 -s 100 -t 3 The times that run tpch test for : 3 2024-04-08 14:11:01 This test will be run for 3 times 2024-04-08 14:11:01 The 1 turn test has started, please wait....... 2024-04-08 14:11:01 Now start to execute the query q1,please wait.....
2024-04-08 14:11:10 The query_id is: 018ebc54-d9d1-73f6-b58c-e9a6d6ee5e09 2024-04-08 14:11:10 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:11:10 The query q1 has be executed successfully,and cost: 9506 2024-04-08 14:11:10 The 2 turn test has started, please wait....... 2024-04-08 14:11:10 Now start to execute the query q1,please wait.....
2024-04-08 14:11:18 The query_id is: 018ebc54-ff1b-79eb-9904-a3a06703520b 2024-04-08 14:11:18 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:11:18 The query q1 has be executed successfully,and cost: 7952 2024-04-08 14:11:18 The 3 turn test has started, please wait....... 2024-04-08 14:11:18 Now start to execute the query q1,please wait.....
2024-04-08 14:11:26 The query_id is: 018ebc55-1e55-7d28-846c-22c21acc5f4e 2024-04-08 14:11:26 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:11:26 The query q1 has be executed successfully,and cost: 8014 2024-04-08 14:11:26 This test has been executed successfully 2024-04-08 14:11:26 The sum cost of turn[1] is 9506 ms. 2024-04-08 14:11:26 The sum cost of turn[2] is 7952 ms. 2024-04-08 14:11:26 The sum cost of turn[3] is 8014 ms. 2024-04-08 14:11:26 The avg cost of all turns is 8490 ms.
Additional information
For example: Have you compared MatrixOne with other databases? If yes, what's their difference?
再补充一个测试,zywl的tp性能压测,revert掉该pr后,card_select_vccode+status从150提升到290 [card_select_vccode+status] START : 2024-04-08 15:30:33 END : 2024-04-08 15:31:54 VUSER : 100 TPS : 157 QPS : 157 SUCCESS : 12210 ERROR : 0 RT_MAX : 10449 RT_MIN : 52 RT_AVG : 633.19 SUC_RATE : 1.0 EXP_RATE : 1.0 RESULT : SUCCEED
[card_select_vccode+status] START : 2024-04-08 15:28:15 END : 2024-04-08 15:29:44 VUSER : 100 TPS : 295 QPS : 295 SUCCESS : 25314 ERROR : 0 RT_MAX : 4247 RT_MIN : 58 RT_AVG : 338.26 SUC_RATE : 1.0 EXP_RATE : 1.0 RESULT : SUCCEED
测了一下807cd1804和a4b312没什么差别。。。。可能和129这个机器有关系。。
造成这个问题的原因是因为mmap分配内存的竞争会大于make(和机器有关系,看着可能和内核版本有关系(暂不明确具体的因素),我工作的机器基本上没差别),同时make的竞争会大于C.malloc,下面是一个简单的例子:
package main
// #include <stdlib.h>
import "C"
import (
"syscall"
"unsafe"
)
var fd = -1
const (
// MaxArrayLen is a safe maximum length for slices on this architecture.
MaxArrayLen = 1<<50 - 1
)
//go:linkname throw runtime.throw
func throw(s string)
func malloc(size int) []byte {
if size == 0 {
return make([]byte, 0)
}
// We need to be conscious of the Cgo pointer passing rules:
//
// https://golang.org/cmd/cgo/#hdr-Passing_pointers
//
// ...
// Note: the current implementation has a bug. While Go code is permitted
// to write nil or a C pointer (but not a Go pointer) to C memory, the
// current implementation may sometimes cause a runtime error if the
// contents of the C memory appear to be a Go pointer. Therefore, avoid
// passing uninitialized C memory to Go code if the Go code is going to
// store pointer values in it. Zero out the memory in C before passing it
// to Go.
ptr := C.calloc(C.size_t(size), 1)
if ptr == nil {
// NB: throw is like panic, except it guarantees the process will be
// terminated. The call below is exactly what the Go runtime invokes when
// it cannot allocate memory.
throw("out of memory")
}
// Interpret the C pointer as a pointer to a Go array, then slice.
return (*[MaxArrayLen]byte)(unsafe.Pointer(ptr))[:size:size]
}
func free(data []byte) {
if cap(data) != 0 {
if len(data) == 0 {
data = data[:cap(data)]
}
ptr := unsafe.Pointer(&data[0])
C.free(ptr)
}
}
func malloc1(size int) []byte {
return make([]byte, size)
}
func free1(data []byte) {
return
}
func malloc2(size int) []byte {
r0, _, e1 := syscall.Syscall6(syscall.SYS_MMAP, 0, uintptr(size), uintptr(syscall.PROT_READ|syscall.PROT_WRITE),
uintptr(syscall.MAP_ANON|syscall.MAP_PRIVATE), uintptr(fd), uintptr(0))
if e1 != 0 {
throw("out of memory")
}
return unsafe.Slice((*byte)(unsafe.Pointer(r0)), size)
}
func free2(data []byte) {
size := cap(data)
syscall.Syscall(syscall.SYS_MUNMAP, uintptr(unsafe.Pointer(&data[0])), uintptr(size), 0)
return
}
// test parallel alloc
func main() {
size := 1024
for i := 0; i < 19; i++ {
go func() {
for {
data := malloc2(size)
for j := 0; j < size; j++ {
data[j] = byte(j)
}
free2(data)
}
}()
}
for {
data := malloc2(size)
for j := 0; j < size; j++ {
data[j] = byte(j)
}
free2(data)
}
}
三种情况的cpu占用如下:
mmap: 400% - (经过测验,在不同的机器,这个差距很大,有的机器只有300%多一些,有的接近600%(我的工作站))
make: 700%
malloc: 2000%
这三种测试例子的prof分别如下:
mmap:
make:
malloc:
可以看到make和mmap分别存在不同程度的锁开销,至于多线程的锁信息如下:
mmap:
make:
malloc:
目前的策略采用mmap是因为可以定量内存,如果采用make/malloc或者其他分配器,都会引入一个分配器池子的内存问题,这个池子在频繁的分配的时候会变得很大,很难控制。。目前我没特别的办法。。 @badboynt1
no process
no process
已修复
https://github.com/matrixorigin/matrixone/commit/2ce9d1311d76b84c5f909e0d1d2eee2f8755d685
pr has been reverted.
性能问题已解决,后面需要解决内存分配问题
pr已发,等合并
malloc已经重新实现,并合并,测试性能和内存占用都没问题
关联commit:https://github.com/matrixorigin/matrixone/commit/8eaa6ebb12f5996ac9ab4a5e9ea296df6cd38198
fixed