redis
redis copied to clipboard
Introduce RDMA transport
Main changes in this patch:
- introduce Redis Over RDMA protocol, see Protocol section in RDMA.md
- implement server side of connection module only, this means we can NOT compile RDMA support as built-in.
- add necessary information in RDMA.md
- support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380', then check this by 'rdma res show cm_id' and redis-cli(with RDMA support, but not implemented in this patch)
- the full listeners show like(): listener0:name=tcp,bind=,bind=-::,port=6379 listener1:name=unix,bind=/var/run/redis.sock listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379 listener3:name=tls,bind=,bind=-::,port=16379
valgrind test works fine:
valgrind --track-origins=yes --suppressions=./src/valgrind.sup
--show-reachable=no --show-possibly-lost=no --leak-check=full
--log-file=err.txt ./src/redis-server --port 6379
--loadmodule src/redis-rdma.so port=6379 bind=xx.xx.xx.xx
--loglevel verbose --protected-mode no --server_cpulist 2
--bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3
--appendonly no
performance test: server side:
./src/redis-server --port 6379 # TCP port 6379 has no conflict with RDMA port 6379
--loadmodule src/redis-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy
--loglevel verbose --protected-mode no --server_cpulist 2 --bio_cpulist 3
--aof_rewrite_cpulist 3 --bgsave_cpulist 3 --appendonly no
build a redis-benchmark with RDMA support(not implemented in this patch), run on a x86(Intel Platinum 8260) with RoCEv2 interface(Mellanox ConnectX-5): client side:
./src/redis-benchmark -h xx.xx.xx.xx -p 6379 -c 30 -n 10000000 --threads 4
-d 1024 -t ping,get,set --rdma
====== PING_INLINE ======
480561.28 requests per second, 0.060 msec avg latency.
====== PING_MBULK ======
540482.06 requests per second, 0.053 msec avg latency.
====== SET ======
399952.00 requests per second, 0.073 msec avg latency.
====== GET ======
443498.31 requests per second, 0.065 msec avg latency.
MR & Issues in history: ISSUE Support RDMA as tranport layer protocol MR Support RDMA as tranport layer protocol MR Support RDMA as tranport layer protocol by rsocket MR Fully abstract connection and make TLS dynamically loadable
Hi @yossigo @oranagra Could you please give me any hint about this PR? I'd like to know your concern/suggestion/plan ...
Hi Redis committee members @yossigo @oranagra. I'm a Ph.D. student at Stanford focusing on cloud computing. In my recent research project, I found this RDMA implementation for Redis to be both impressively organized and performant. Given the widespread use of Redis as a benchmark in high-performance and I/O software stack studies, setting up a standard for RDMA implementation would greatly enhance the relevance and fairness of comparative analyses. Currently, many RDMA implementations reported in research papers are ad-hoc and infrequently publicly available, skewing benchmarks and comparisons. I believe that this PR can potentially serve as a starting point and it might be worth having a core member steering this development with @pizhenwei. Please consider brining this up in an upcoming committee meeting, as it will undoubtedly benefit a broad audience.
Sorry for the lack of response here, @pizhenwei my sincere apologies, this PR is still marked in my followup list, but i never got to look at it. The fact is that we simply lack the time and knowledge to really review it an make any informed decisions. Without these, i don't think it'll be right to merge it and officially support it. Not sure how we can proceed, sorry. on the bright side, after all the work we previously did, the interfaces are stable, so this code can maybe be maintained elsewhere in the meanwhile..
Sorry for the lack of response here, @pizhenwei my sincere apologies, this PR is still marked in my followup list, but i never got to look at it. The fact is that we simply lack the time and knowledge to really review it an make any informed decisions. Without these, i don't think it'll be right to merge it and officially support it. Not sure how we can proceed, sorry. on the bright side, after all the work we previously did, the interfaces are stable, so this code can maybe be maintained elsewhere in the meanwhile..
Hi @oranagra After Redis 7.2, I believe that Redis Over RDMA could be built as a shared library, and loaded by redis-server dynamically. As you mentioned, "this code can maybe be maintained elsewhere" works in theory. However, Redis is ecosystem which has proxy and client implements, it needs "Redis Over RDMA" protocol to transfer payload. From the point of my view, Redis side SHOULD define "Redis Over RDMA" protocol at least. Then the implementation could be:
- part of redis source code(like this PR).
- an Out-Of-Tree connection library(maybe another repo).
Ether is fine to me, I volunteer to contribute/maintain this driver.
Hi Redis committee members @yossigo @oranagra. I'm a Ph.D. student at Stanford focusing on cloud computing. In my recent research project, I found this RDMA implementation for Redis to be both impressively organized and performant. Given the widespread use of Redis as a benchmark in high-performance and I/O software stack studies, setting up a standard for RDMA implementation would greatly enhance the relevance and fairness of comparative analyses. Currently, many RDMA implementations reported in research papers are ad-hoc and infrequently publicly available, skewing benchmarks and comparisons. I believe that this PR can potentially serve as a starting point and it might be worth having a core member steering this development with @pizhenwei. Please consider brining this up in an upcoming committee meeting, as it will undoubtedly benefit a broad audience.
Hi @xiezhq-hermann Thanks for your feedback, glad to see this!
However, Redis is ecosystem which has proxy and client implements, it needs "Redis Over RDMA" protocol to transfer payload. From the point of my view, Redis side SHOULD define "Redis Over RDMA" protocol at least.
I understand that this part must be official in order for libraries to be able to adapt it, but i don't think i have the time or knowledge to make any informed decision about it. @yossigo maybe you have an advise.
I agree there should be a formal specification of Redis over RDMA, but I also don't have the knowledge necessary to approach this. @pizhenwei Do you have a rough outine of how such spec needs to look like? For example, are there major decisions we'd need to make first? Are there other notable examples of similar protocols that have an RDMA-based spec?
I agree there should be a formal specification of Redis over RDMA, but I also don't have the knowledge necessary to approach this. @pizhenwei Do you have a rough outine of how such spec needs to look like? For example, are there major decisions we'd need to make first? Are there other notable examples of similar protocols that have an RDMA-based spec?
Hi @yossigo @oranagra
Let me enumerate several RDMA-based spec here(as far as I know):
- storage specific spec: iSER and NVMe-oF are quite famous and popular. But this type protocol has distinguishing feature which has fixed length of payload. For example, read/write 4K/8K and so on.(we don't know a KV length before getting from redis)
- a common RPC lib. For example: brpc. This is designed as a common RPC framework. This is quite common to use, but not performance first.
- SMC-R is ready since linux 4.11, from the test result, it seems not good enough.(up to ~1.4X QPS).
- Papers from university. Ring buffer with RDMA write/write with imm. links: Socksdirect: datacenter sockets can be fast and compatible LITE Kernel RDMA Support for Datacenter Applications FaRM: Fast Remote Memory
In our proposal, we(with my colleagues @zhuojiang123 @zhangyiming1201 from RDMA team and friend @sigempty from Duke university) introduce a Ring buffer like mechanism for Redis(similar to papers, but not same).
I create a new PR to introduce the protocol only. Maybe I should drop the protocol part from this PR.
Cc @xiezhq-hermann @Spartee suggestions/feedback are welcomed!
In our proposal, we(with my colleagues @zhuojiang123 @zhangyiming1201 from RDMA team and friend @sigempty from Duke university) introduce a Ring buffer like mechanism for Redis(similar to papers, but not same).
As a key technology for high-speed networks, RDMA has been deployed on a large scale within ByteDance and has been applied in multiple services. We have been paying attention to the huge performance improvement of RDMA for Redis, and we are willing to provide more supplementary information about RDMA if necessary.
We will work with @pizhenwei to promote the optimization of the Redis over RDMA protocol continuously. Looking forward to its final realization. Believe it will play a certain role in various related applications.
I am wondering whether performance is the only criterion for evaluation.
I am wondering whether performance is the only criterion for evaluation.
Yes, the performance is NOT good enough. But I'm not against SMC-R for Redis socket.
However this is not the entire reason. Generally, the protocol gets more close to the uplayer, the uplayer reaches higher performance.
- SMC-R is designed as a general protocol, it's easy to use for user processes. For a regular/personal program which is NOT deployed widely, SMC-R may be a good choice.
- But this one(Redis Over RDMA) is designed for Redis, it adapts to the Redis streaming transmission(from message transmission, also feature negotiation). As the test result shows, it gets 2.5x QPS. And Redis is almost deployed in the whole world, it's worth developing an exclusive protocol.
Hi @pizhenwei
I work with the OFA on maintaining the FSDP cluster and we were looking to get this patch tested on the appropriate hardware. When working with this patch, it still applies on the unstable branch, but it seems like it uses the variable MAX_ACCEPTS_PER_CALL which no longer exists in the Redis repository, so it no longer builds. I was wondering if you had any insight into this issue, or if you would be able to rebase the commit onto the tip of the unstable branch. I can rollback to the commit it was applied on for the time being and test it there, but it would be great to be able to test the patch on latest as well. Thank you in advance for any assistance!
Hi @pizhenwei
I work with the OFA on maintaining the FSDP cluster and we were looking to get this patch tested on the appropriate hardware. When working with this patch, it still applies on the unstable branch, but it seems like it uses the variable
MAX_ACCEPTS_PER_CALLwhich no longer exists in the Redis repository, so it no longer builds. I was wondering if you had any insight into this issue, or if you would be able to rebase the commit onto the tip of the unstable branch. I can rollback to the commit it was applied on for the time being and test it there, but it would be great to be able to test the patch on latest as well. Thank you in advance for any assistance!
Hi @JSpewock
I rebased code against the latest unstable (commit 4df037962dd446a4a686e2b6d56d5367b6c9f0db), and force pushed.
I also tested this patch, it would be applied after commit c3f8b542eecec1a52de88f402e310687c36d03e9(Manage number of new connections per cycle (#12178)). So if you want to apply this patch against a released version, I guest these commands would work successfully:
git format-patch -1 (in this branch, and xxx.patch should be generated)
git am xxx.patch (apply xxx.patch to the released version)
Any suggestoin&feedback is welcome! Have fun!
Hi @pizhenwei
Out of curiosity, how have you been testing the server when you update the PR? When do a make RDMA_BUILD=module and then try to run the redis-benchmark that you provide in the original PR description, the benchmark hangs and the server is accepting the connection but then throwing the error # RDMA: FATAL error, unknown cmd. Do you know why this would be happening? I've tried Infiniband, Omni-Path, RoCE, and iwarp fabrics but they all throw the same message.
Hi @pizhenwei
Out of curiosity, how have you been testing the server when you update the PR? When do a
make RDMA_BUILD=moduleand then try to run the redis-benchmark that you provide in the original PR description, the benchmark hangs and the server is accepting the connection but then throwing the error# RDMA: FATAL error, unknown cmd. Do you know why this would be happening? I've tried Infiniband, Omni-Path, RoCE, and iwarp fabrics but they all throw the same message.
Hi,
I forgot to update the client side URL in this PR. It should be https://github.com/pizhenwei/redis/tree/feature-rdma-with-cli.
PS: the new command is used for feature negotiation, once any new optional feature/command is supported in the future, the different version between server and client would not block the communication.
I only test this on RoCE/RXE(soft RoCE of linux), fixes on other platform are welcome!
Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.2xlarge, 16 vCPUs, 32G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.
Test command of server side:
./src/redis-server --port 6379 \
--loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
--loglevel verbose --protected-mode no \
--server_cpulist 2 --bio_cpulist 3 --aof_rewrite_cpulist 3 \
--bgsave_cpulist 3 --appendonly no
Test command of client side:
# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 \
--threads 4 -d 1024 -t ping,get,set --rdma
# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 \
--threads 4 -d 1024 -t ping,get,set
The performance test results are as shown in the following table. The throughput can be increased by at least 74%, and the average (AVG) and P99 latencies can be reduced by at least 33%.
| RDMA | TCP | RDMA vs TCP | |
|---|---|---|---|
| PING_INLINE | |||
| Throughput | 644122.38 | 376704.59 | 170.99% |
| Latency-AVG | 0.045 | 0.076 | 59.21% |
| Latency-P99 | 0.079 | 0.119 | 66.39% |
| PING_MBULK | |||
| Throughput | 665601.69 | 380314.91 | 175.01% |
| Latency-AVG | 0.044 | 0.075 | 58.67% |
| Latency-P99 | 0.079 | 0.119 | 66.39% |
| SET | |||
| Throughput | 596979.31 | 314940.78 | 189.55% |
| Latency-AVG | 0.049 | 0.092 | 53.26% |
| Latency-P99 | 0.079 | 0.143 | 55.24% |
| GET | |||
| Throughput | 615308.88 | 335570.47 | 183.36% |
| Latency-AVG | 0.047 | 0.085 | 55.29% |
| Latency-P99 | 0.079 | 0.135 | 58.52% |
Besides, I seen the comment from @yossigo that RDMA networks are not easily accessible. If necessary, I could try reaching out to relevant colleagues to see if we can offer some Alibaba Cloud ECS instances to the community for free, so that you can use and test Redis over RDMA, as well as for future CI/CD purposes.
Hi @hz-cheng I notice that you are the author of alibaba-cloud erdma driver for both linux kernel and rdma-core. Cooooooooool!
I'm happy to see the feedback from the cloud vendor side, this means lots of end-user will enjoy the improvement easily.
Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.2xlarge, 16 vCPUs, 32G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.
Test command of server side:
./src/redis-server --port 6379 \ --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \ --loglevel verbose --protected-mode no \ --server_cpulist 2 --bio_cpulist 3 --aof_rewrite_cpulist 3 \ --bgsave_cpulist 3 --appendonly noTest command of client side:
# Test for RDMA ./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 \ --threads 4 -d 1024 -t ping,get,set --rdma # Test for TCP socket ./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 \ --threads 4 -d 1024 -t ping,get,setThe performance test results are as shown in the following table. The throughput can be increased by at least 74%, and the average (AVG) and P99 latencies can be reduced by at least 33%.
RDMA TCP RDMA vs TCP PING_INLINE Throughput 644122.38 376704.59 170.99% Latency-AVG 0.045 0.076 59.21% Latency-P99 0.079 0.119 66.39% PING_MBULK Throughput 665601.69 380314.91 175.01% Latency-AVG 0.044 0.075 58.67% Latency-P99 0.079 0.119 66.39% SET Throughput 596979.31 314940.78 189.55% Latency-AVG 0.049 0.092 53.26% Latency-P99 0.079 0.143 55.24% GET Throughput 615308.88 335570.47 183.36% Latency-AVG 0.047 0.085 55.29% Latency-P99 0.079 0.135 58.52% Besides, I seen the comment from @yossigo that RDMA networks are not easily accessible. If necessary, I could try reaching out to relevant colleagues to see if we can offer some Alibaba Cloud ECS instances to the community for free, so that you can use and test Redis over RDMA, as well as for future CI/CD purposes.
@yossigo @oranagra PING!
So sad, years of waiting have made me lose patience and confidence.