matrixone [Performance]: memcache with rc bytes 导致tp以及ap性能下降太多

Is there an existing issue for performance?

[X] I have checked the existing issues.

Environment

在129环境上测试，tpch100g q1， memcache设置成32g或者2g都不影响测试结果。
在主分支上测试，q1平均21秒，revert掉memcache那个 pr之后，平均性能是8秒。性能差距太明显

Details of Performance

在主分支上测试，q1平均21秒 [root@mo-srv-129 mo-tpch]# ./run.sh -q q1 -s 100 -t 3 The times that run tpch test for : 3 2024-04-08 14:13:05 This test will be run for 3 times 2024-04-08 14:13:05 The 1 turn test has started, please wait....... 2024-04-08 14:13:05 Now start to execute the query q1,please wait.....

2024-04-08 14:13:21 The query_id is: 018ebc56-bf0d-75e7-b661-5ab91496945e 2024-04-08 14:13:22 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:13:22 The query q1 has be executed successfully,and cost: 16604 2024-04-08 14:13:22 The 2 turn test has started, please wait....... 2024-04-08 14:13:22 Now start to execute the query q1,please wait.....

2024-04-08 14:13:44 The query_id is: 018ebc56-fffc-774a-993b-6644ab49f03f 2024-04-08 14:13:44 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:13:44 The query q1 has be executed successfully,and cost: 22835 2024-04-08 14:13:44 The 3 turn test has started, please wait....... 2024-04-08 14:13:44 Now start to execute the query q1,please wait.....

2024-04-08 14:14:08 The query_id is: 018ebc57-595b-7541-a3ba-1e979632bf6b 2024-04-08 14:14:08 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:14:08 The query q1 has be executed successfully,and cost: 23826 2024-04-08 14:14:08 This test has been executed successfully 2024-04-08 14:14:08 The sum cost of turn[1] is 16604 ms. 2024-04-08 14:14:08 The sum cost of turn[2] is 22835 ms. 2024-04-08 14:14:08 The sum cost of turn[3] is 23826 ms. 2024-04-08 14:14:08 The avg cost of all turns is 21088 ms.

git revert a4b312c83f81e10d5946900a3bf1590dd9f72884 Replace the memory cache with rc bytes (#14641) revert掉该pr之后，平均性能是8秒。性能差距太明显 [root@mo-srv-129 mo-tpch]# ./run.sh -q q1 -s 100 -t 3 The times that run tpch test for : 3 2024-04-08 14:11:01 This test will be run for 3 times 2024-04-08 14:11:01 The 1 turn test has started, please wait....... 2024-04-08 14:11:01 Now start to execute the query q1,please wait.....

2024-04-08 14:11:10 The query_id is: 018ebc54-d9d1-73f6-b58c-e9a6d6ee5e09 2024-04-08 14:11:10 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:11:10 The query q1 has be executed successfully,and cost: 9506 2024-04-08 14:11:10 The 2 turn test has started, please wait....... 2024-04-08 14:11:10 Now start to execute the query q1,please wait.....

2024-04-08 14:11:18 The query_id is: 018ebc54-ff1b-79eb-9904-a3a06703520b 2024-04-08 14:11:18 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:11:18 The query q1 has be executed successfully,and cost: 7952 2024-04-08 14:11:18 The 3 turn test has started, please wait....... 2024-04-08 14:11:18 Now start to execute the query q1,please wait.....

2024-04-08 14:11:26 The query_id is: 018ebc55-1e55-7d28-846c-22c21acc5f4e 2024-04-08 14:11:26 Now start to compare query result for [report/TPCH_100/q1.txt,golden/TPCH_100/q1.txt] 2024-04-08 14:11:26 The query q1 has be executed successfully,and cost: 8014 2024-04-08 14:11:26 This test has been executed successfully 2024-04-08 14:11:26 The sum cost of turn[1] is 9506 ms. 2024-04-08 14:11:26 The sum cost of turn[2] is 7952 ms. 2024-04-08 14:11:26 The sum cost of turn[3] is 8014 ms. 2024-04-08 14:11:26 The avg cost of all turns is 8490 ms.

Additional information

For example: Have you compared MatrixOne with other databases? If yes, what's their difference?

Apr 08 '24 06:04 badboynt1

再补充一个测试，zywl的tp性能压测，revert掉该pr后，card_select_vccode+status从150提升到290 [card_select_vccode+status] START : 2024-04-08 15:30:33 END : 2024-04-08 15:31:54 VUSER : 100 TPS : 157 QPS : 157 SUCCESS : 12210 ERROR : 0 RT_MAX : 10449 RT_MIN : 52 RT_AVG : 633.19 SUC_RATE : 1.0 EXP_RATE : 1.0 RESULT : SUCCEED

[card_select_vccode+status] START : 2024-04-08 15:28:15 END : 2024-04-08 15:29:44 VUSER : 100 TPS : 295 QPS : 295 SUCCESS : 25314 ERROR : 0 RT_MAX : 4247 RT_MIN : 58 RT_AVG : 338.26 SUC_RATE : 1.0 EXP_RATE : 1.0 RESULT : SUCCEED

Apr 08 '24 07:04 badboynt1

测了一下807cd1804和a4b312没什么差别。。。。可能和129这个机器有关系。。

Apr 08 '24 08:04 nnsgmsone

在129环境上，用fgprof抓了运行期间的等待耗时。通过对比发现是等待锁的时间变长了。同时cpu跑不满。基本上性能下降的原因就在这里。

企业微信截图_17126280942512

企业微信截图_17126281289929

profile.zip

Apr 09 '24 02:04 badboynt1

造成这个问题的原因是因为mmap分配内存的竞争会大于make(和机器有关系，看着可能和内核版本有关系(暂不明确具体的因素)，我工作的机器基本上没差别)，同时make的竞争会大于C.malloc，下面是一个简单的例子:

package main

// #include <stdlib.h>
import "C"
import (
	"syscall"
	"unsafe"
)

var fd = -1

const (
	// MaxArrayLen is a safe maximum length for slices on this architecture.
	MaxArrayLen = 1<<50 - 1
)

//go:linkname throw runtime.throw
func throw(s string)

func malloc(size int) []byte {
	if size == 0 {
		return make([]byte, 0)
	}
	// We need to be conscious of the Cgo pointer passing rules:
	//
	//   https://golang.org/cmd/cgo/#hdr-Passing_pointers
	//
	//   ...
	//   Note: the current implementation has a bug. While Go code is permitted
	//   to write nil or a C pointer (but not a Go pointer) to C memory, the
	//   current implementation may sometimes cause a runtime error if the
	//   contents of the C memory appear to be a Go pointer. Therefore, avoid
	//   passing uninitialized C memory to Go code if the Go code is going to
	//   store pointer values in it. Zero out the memory in C before passing it
	//   to Go.
	ptr := C.calloc(C.size_t(size), 1)
	if ptr == nil {
		// NB: throw is like panic, except it guarantees the process will be
		// terminated. The call below is exactly what the Go runtime invokes when
		// it cannot allocate memory.
		throw("out of memory")
	}
	// Interpret the C pointer as a pointer to a Go array, then slice.
	return (*[MaxArrayLen]byte)(unsafe.Pointer(ptr))[:size:size]
}

func free(data []byte) {
	if cap(data) != 0 {
		if len(data) == 0 {
			data = data[:cap(data)]
		}
		ptr := unsafe.Pointer(&data[0])
		C.free(ptr)
	}
}

func malloc1(size int) []byte {
	return make([]byte, size)
}

func free1(data []byte) {
	return
}

func malloc2(size int) []byte {
	r0, _, e1 := syscall.Syscall6(syscall.SYS_MMAP, 0, uintptr(size), uintptr(syscall.PROT_READ|syscall.PROT_WRITE),
		uintptr(syscall.MAP_ANON|syscall.MAP_PRIVATE), uintptr(fd), uintptr(0))
	if e1 != 0 {
		throw("out of memory")
	}
	return unsafe.Slice((*byte)(unsafe.Pointer(r0)), size)
}

func free2(data []byte) {
	size := cap(data)
	syscall.Syscall(syscall.SYS_MUNMAP, uintptr(unsafe.Pointer(&data[0])), uintptr(size), 0)
	return
}

// test parallel alloc
func main() {
	size := 1024
	for i := 0; i < 19; i++ {
		go func() {
			for {
				data := malloc2(size)
				for j := 0; j < size; j++ {
					data[j] = byte(j)
				}
				free2(data)
			}
		}()
	}
	for {
		data := malloc2(size)
		for j := 0; j < size; j++ {
			data[j] = byte(j)
		}
		free2(data)
	}
}

三种情况的cpu占用如下:

mmap: 400% - （经过测验，在不同的机器，这个差距很大，有的机器只有300%多一些，有的接近600%(我的工作站)）
make: 700%
malloc: 2000%

这三种测试例子的prof分别如下: mmap: mmap

make: make

malloc:

malloc

可以看到make和mmap分别存在不同程度的锁开销，至于多线程的锁信息如下: mmap: mmap make: make

malloc: malloc

目前的策略采用mmap是因为可以定量内存，如果采用make/malloc或者其他分配器，都会引入一个分配器池子的内存问题，这个池子在频繁的分配的时候会变得很大，很难控制。。目前我没特别的办法。。 @badboynt1

Apr 11 '24 08:04 nnsgmsone

no process

Apr 16 '24 10:04 nnsgmsone

no process

Apr 23 '24 11:04 nnsgmsone

已修复

Apr 28 '24 10:04 reusee

https://github.com/matrixorigin/matrixone/commit/2ce9d1311d76b84c5f909e0d1d2eee2f8755d685

Apr 28 '24 10:04 reusee

pr has been reverted.

May 08 '24 03:05 aressu1985

性能问题已解决，后面需要解决内存分配问题

May 13 '24 10:05 reusee

pr已发，等合并

May 17 '24 06:05 reusee

malloc已经重新实现，并合并，测试性能和内存占用都没问题

May 20 '24 03:05 reusee

关联commit：https://github.com/matrixorigin/matrixone/commit/8eaa6ebb12f5996ac9ab4a5e9ea296df6cd38198

May 20 '24 03:05 reusee

fixed

Jul 02 '24 10:07 aressu1985