valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Introduce Valkey Over RDMA transport

Open pizhenwei opened this issue 1 year ago • 8 comments

Hi,

Since 2021/06, I created a PR for Redis Over RDMA proposal. Then I did some work to fully abstract connection and make TLS dynamically loadable, a new connection type could be built into Redis statically, or a separated shared library(loaded by Redis on startup) since Redis 7.2.0.

Base on the new connection framework, I created a new PR, some guys(@xiezhq-hermann @zhangyiming1201 @JSpewock @uvletter @FujiZ) noticed, played and tested this PR. However, because of the lack of time and knowledge from the maintainers, this PR has been pending about 2 years.

Changes in this PR:

  • introduce Valkey Over RDMA specification. (same as Redis, and this should be same)
  • implement Valkey Over RDMA. (compact the Valkey style)

Finally, if this feature is considered to merge, I volunteer to maintain it.

pizhenwei avatar May 09 '24 06:05 pizhenwei

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.24%. Comparing base (9948f07) to head (9d24d15).

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable     #477   +/-   ##
=========================================
  Coverage     70.23%   70.24%           
=========================================
  Files           112      112           
  Lines         60602    60602           
=========================================
+ Hits          42566    42568    +2     
+ Misses        18036    18034    -2     

see 9 files with indirect coverage changes

codecov[bot] avatar May 09 '24 07:05 codecov[bot]

This PR could be tested by client.

To build client with RDMA:

make BUILD_RDMA=yes -j16

To test by commands:

Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
                 server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma \
	  --server_cpulist 2 --bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

pizhenwei avatar May 11 '24 07:05 pizhenwei

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.

Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.

RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

hz-cheng avatar May 20 '24 06:05 hz-cheng

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.

Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.

指标 RDMA TCP RDMA/TCP PING_INLINE Throughput 666577.81 366394.31 181.93% Latency-AVG 0.044 0.08 55.00% Latency-P99 0.063 0.127 49.61% PING_MBULK Throughput 688657.81 395397.56 174.17% Latency-AVG 0.042 0.073 57.53% Latency-P99 0.063 0.119 52.94% SET Throughput 434744.78 157726.22 275.63% Latency-AVG 0.068 0.188 36.17% Latency-P99 0.111 0.183 60.66% GET Throughput 562587.94 319478.59 176.10% Latency-AVG 0.052 0.091 57.14% Latency-P99 0.079 0.151 52.32% LRANGE Throughput 526260.38 211434.36 248.90% Latency-AVG 0.056 0.14 40.00% Latency-P99 0.079 0.159 49.69% LRANGE_100 Throughput 57106.96 49498.34 115.37% Latency-AVG 0.427 0.499 85.57% Latency-P99 4.207 13.367 31.47%

Hi, @hz-cheng

I notice that you are the author of alibaba-cloud erdma driver for both linux kernel and rdma-core. Cooooooooool!

pizhenwei avatar May 20 '24 10:05 pizhenwei

More, If necessary, I could try reaching out to relevant colleagues to see if we can offer some Alibaba Cloud ECS instances to the community for free, so that the community can use and test Valkey over RDMA, as well as for future CI/CD purposes.

hz-cheng avatar May 21 '24 06:05 hz-cheng

Is there a corresponding client that enables RDMA?

baronwangr avatar May 21 '24 09:05 baronwangr

Is there a corresponding client that enables RDMA?

See this comment please.

pizhenwei avatar May 21 '24 09:05 pizhenwei

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance. Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%. 指标 RDMA TCP RDMA/TCP PING_INLINE Throughput 666577.81 366394.31 181.93% Latency-AVG 0.044 0.08 55.00% Latency-P99 0.063 0.127 49.61% PING_MBULK Throughput 688657.81 395397.56 174.17% Latency-AVG 0.042 0.073 57.53% Latency-P99 0.063 0.119 52.94% SET Throughput 434744.78 157726.22 275.63% Latency-AVG 0.068 0.188 36.17% Latency-P99 0.111 0.183 60.66% GET Throughput 562587.94 319478.59 176.10% Latency-AVG 0.052 0.091 57.14% Latency-P99 0.079 0.151 52.32% LRANGE Throughput 526260.38 211434.36 248.90% Latency-AVG 0.056 0.14 40.00% Latency-P99 0.079 0.159 49.69% LRANGE_100 Throughput 57106.96 49498.34 115.37% Latency-AVG 0.427 0.499 85.57% Latency-P99 4.207 13.367 31.47%

Hi, @hz-cheng

I notice that you are the author of alibaba-cloud erdma driver for both linux kernel and rdma-core. Cooooooooool!

Hi @madolson , The feedback from the cloud vendor(alibaba-cloud) side shows the improvement, this means lots of end-user will enjoy it easily. Please let me know any concern about this feature.

pizhenwei avatar May 24 '24 01:05 pizhenwei

Almost doubled throughput is impressive. I don't know much about RDMA. It's many lines of code, but all of it is the module. That's great, but what are the risks of breaking it if we change something in the connection abstractions? We need to be aware that when we merge this, we will have to keep maintaining this.

Is it possible to use TLS with RDMA?

zuiderkwast avatar May 26 '24 18:05 zuiderkwast

@pizhenwei The numbers do look great. I haven't gotten a chance to look at it yet, hopefully some time this week.

madolson avatar May 26 '24 18:05 madolson

Almost doubled throughput is impressive. I don't know much about RDMA. It's many lines of code, but all of it is the module. That's great, but what are the risks of breaking it if we change something in the connection abstractions? We need to be aware that when we merge this, we will have to keep maintaining this.

Hi,

Because the valkey-rdma.so(if built as a module) uses the struct ConnectionType as ABI, the rdma support must change together with the core connection abstractions.

To avoid the ricks from the mismatched struct ConnectionType, the module side check the version strictly(so does valkey-tls.so) like:

    /* Connection modules MUST be part of the same build as valkey. */
    if (strcmp(REDIS_BUILD_ID_RAW, serverBuildIdRaw())) {
        serverLog(LL_NOTICE, "Connection type %s was not built together with the valkey-server used.", CONN_TYPE_RDMA);
        return VALKEYMODULE_ERR;
    }

Once the core connection abstraction changes, all the connection types should do compat work, this rule is also applicative for rdma. I volunteer to maintain this rdma support.

PS: I have experience on open source community like Linux kernel, QEMU, Redis, SPDK, libiscsi, tgt, atop, utils-linux and procps-ns.

Is it possible to use TLS with RDMA?

As far as I can see, we can't use TLS with RDMA currently. I read document of openssl Abstract Record Layer, TLS with RDMA is workable in theory. But it would be amount of work.

pizhenwei avatar May 27 '24 02:05 pizhenwei

@pizhenwei Thanks for your contribution and @hz-cheng Thanks for your perfect number.
First, I need to say I like this feature. But I have 3 curious points here

  1. In the RDMA.md, you mentioned that Valkey Over RDMA is only supported by Linux, I am not sure if it is supported by Centos and MacOS? Or you mean it is not supported by Windows?

  2. Is there possible to integrate with core codes directly instead of working as a module in Valkey? any risky or difficulty?

  3. You mention both RDMA and TCP enable at the same time? Is there any benefit for it? it means some specific clients or replica nodes connect to RDMA port? Could you please describe a little bit more? Thanks

Let's core team member discuss this important feature, and send you feedback ASAP, Thanks

hwware avatar May 28 '24 18:05 hwware

@pizhenwei Thanks for your contribution and @hz-cheng Thanks for your perfect number. First, I need to say I like this feature. But I have 3 curious points here

  1. In the RDMA.md, you mentioned that Valkey Over RDMA is only supported by Linux, I am not sure if it is supported by Centos and MacOS? Or you mean it is not supported by Windows?

There are two parts of this PR:

  • Valkey Over RDMA protocol. This defines the transmission type RC(like TCP), communication commands, and payload exchange mechanism. This depends on RDMA(aka Infiniband) specification only, but OS and hardware independent.
  • The implement of Linux. I developed and tested this feature on Ubuntu 2004/2204 and Debian 9/10. I guess @hz-cheng tested this on a newer version of Linux distribution because erdma got supported recently. I don't test it on CentOS/RHEL/Suse, but I believe it should work fine if hardware driver is ready.

I have no experience on windows RDMA, I read document and found that windows does support RDMA, but not Linux style Verbs API. This means that we need a windows version in the future. (I imagine rdma-windows.c is needed).

  1. Is there possible to integrate with core codes directly instead of working as a module in Valkey? any risky or difficulty?

It's quite easy to build RDMA support into Valkey with a few lines change. If so, the valkey-server has to link libibvers.so and librdmacm.so.

Let's look at the dynamic shared libraries of module version:

# ldd valkey-server 
	linux-vdso.so.1 (0x00007ffdbc546000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f41f0fa1000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f41f0800000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f41f10a8000)
# ldd valkey-rdma.so 
	linux-vdso.so.1 (0x00007fff83bbb000)
	librdmacm.so.1 => /lib/x86_64-linux-gnu/librdmacm.so.1 (0x00007f7e7c29e000)
	libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007f7e7c27b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7e7c000000)
	libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007f7e7c258000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f7e7c2f5000)
	libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007f7e7bf7d000)

If a user starts valkey-server with rdma module, valkey-server loads the additional shared libraries on demand.

If building RDMA into valkey is necessary, please let me know.

  1. You mention both RDMA and TCP enable at the same time? Is there any benefit for it? it means some specific clients or replica nodes connect to RDMA port? Could you please describe a little bit more? Thanks

Let's core team member discuss this important feature, and send you feedback ASAP, Thanks

Currently, the valkey-server support 3 connection type:

  • unix domain socket
  • TCP/IP
  • TLS over TCP/IP (can't use the same port as TCP/IP)

Run valkey-server by command: ./valkey-server --unixsocket /run/valkey.sock --port 6379 --tls-port 6380 ... then valkey-server listens on /run/valkey.sock , TCP/IP port 6379 & 6380 together. The 3 transports could be used together in theory. However, in the production env, we usually prefer TCP/IP in a trusted env for the better performance, or TLS in an untrusted for security.

Once loading RDMA:./valkey-server --unixsocket /run/valkey.sock --port 6379 --tls-port 6380 ... --loadmodule valkey-rdma.so port=6379 ... then valkey-server listens on /run/valkey.sock, TCP/IP port 6379 & 6380, and RDMA port 6379 together. The 4 transports could be used together in theory.

The RDMA has better performance in a good network env like @hz-cheng's and my test report, but I tested mlx5 with packet drop rate 0.001, TCP performance affects a few, but RDMA performance drops a lot. I imagine a topo like:

DC: data center

valkey-client   valkey-client
         |            |
        TCP          RDMA
         |            |
         +----- valkey-server  -----TCP-----  valkey-server(replica)

  DC A            DC B                              DC C

It's possible to use RDMA within a short distance, or TCP over a long distance.

pizhenwei avatar May 29 '24 02:05 pizhenwei

I'd like to get this merged. It's a good contribution. I like that it's a module. When it's merged, we can let clients implement it and test it.

Actually, the client side(for C only) is ready(as you see, several guys and me have already got the test report). Once the server side gets merged, I'll create PR for client as soon as possible.

The RDMA port is neither a TCP port nor an UDP port?

Right.

We've been talking about the possibility of adding QUIC in the future (optional dependency, maybe as a module too) and that can be on the same port too, right, since it's UDP?

Right.

pizhenwei avatar May 29 '24 07:05 pizhenwei

The RDMA has better performance in a good network env like @hz-cheng's and my test report, but I tested mlx5 with packet drop rate 0.001, TCP performance affects a few, but RDMA performance drops a lot. I imagine a topo like:

@pizhenwei Actually the latest RDMA technology such as Alibaba Cloud Elastic RDMA doesn't encounter this performance drop when packet drop rate 0.001,because the latest RDMA technology widely supports SACK lossy optimization.

coderyanghang avatar May 30 '24 02:05 coderyanghang

The RDMA has better performance in a good network env like @hz-cheng's and my test report, but I tested mlx5 with packet drop rate 0.001, TCP performance affects a few, but RDMA performance drops a lot. I imagine a topo like:

@pizhenwei Actually the latest RDMA technology such as Alibaba Cloud Elastic RDMA doesn't encounter this performance drop when packet drop rate 0.001,because the latest RDMA technology widely supports SACK lossy optimization.

Hi, As far as I know, SACK is not part of IB specification, and the stock of hardwares don't support SACK. I don't think the valkey-server should be limited to deploy on the latest hardware only.

Let's focus on 'why does Valkey need to enable both TCP/IP and RDMA together' or 'enabling both TCP/IP and RDMA is useful or not in the real scenario', but not extend the topic to 'the latest RDMA technology' here.

pizhenwei avatar May 30 '24 03:05 pizhenwei

Hi @zuiderkwast ,

I create a new PR for the document part, and force pushed a new version here.

pizhenwei avatar May 30 '24 03:05 pizhenwei

@zuiderkwast I commited this change as your suggestion on github web page, then got a 'DOC' warning. So I force pushed again.

pizhenwei avatar Jun 19 '24 01:06 pizhenwei

@pizhenwei I just learned more RDMA knowledge so far, and I have the following questions and requests for this PR:

  1. Currently, there are three technologies that support RDMA: InfiniBand, Ethernet RoCE and Ethernet iWARP (Reference from NVDIA DOCS).
    InfiniBand network facilitates data transfer through InfiniBand adapters or switches. The key parts are Subnet Manager (SM), InfiniBand network cards, InfiniBand switches, and InfiniBand cables.
    RoCe and IWARP implementations are based on Ethernet and different kinds of network card and switches.

My question is: If user wants to test your RDMA over Valkey, Should user have the InfiniBand network hardware support (Since I saw you mentioned IB in the PR) OR this PR works as an API and supports all these 3 technologies?

image

  1. Can you update this feature work as the following scenario:

    If client build as make BUILD_RDMA=yes, then RDMA works as one build-in feature, and user does not need to run the --loadmodule, its bind address and rdma-port works same as Valkey conf file.

    If client just build as "make" command simply, then RDMA will be disable in Valkey

    The reason why we prefer this way is that we would like to allow users to choose RDMA or not.

  2. If RDMA can bind a different IP address, I think you can add one extra parameter in valkey.conf rdma-ip

  3. I notice your PR README.md file, The line 172 and 182, it is "--loadmodule src/valkey-rdma.so bind=192.168.122.100 port=6379", and The line 176, you write "10.2.16.101:6379> CONFIG SET rdma-port 6380"

    I am confused. Can I understand the port=6379 and rdma-port 6380 are the same meaning, they are rdma-port So line 172 and 182, it should be "--loadmodule src/valkey-rdma.so bind=192.168.122.100 rdma-port=6379"?

    Please connect me if I am wrong, Thanks.

  4. I suggest you add another one argument in Valkey.conf: rdma-port 6379

  5. Is there a way to add a test for this PR?, i mean adding a tcl file.

Thanks

hwware avatar Jun 26 '24 18:06 hwware

@pizhenwei I just learned more RDMA knowledge so far, and I have the following questions and requests for this PR:

  1. Currently, there are three technologies that support RDMA: InfiniBand, Ethernet RoCE and Ethernet iWARP (Reference from NVDIA DOCS). InfiniBand network facilitates data transfer through InfiniBand adapters or switches. The key parts are Subnet Manager (SM), InfiniBand network cards, InfiniBand switches, and InfiniBand cables. RoCe and IWARP implementations are based on Ethernet and different kinds of network card and switches.

My question is: If user wants to test your RDMA over Valkey, Should user have the InfiniBand network hardware support (Since I saw you mentioned IB in the PR) OR this PR works as an API and supports all these 3 technologies?

Right, RDMA supports InfiniBand, RoCE v1, RoCE v2 and iWARP as underlaying transport layer. rdma-core has common abstract API, we don't need to care the underlaying hardware. In fact, I tested it on RoCE v2 hardware, and Alibaba Could instance supports iWARP(see test report).

  1. Can you update this feature work as the following scenario: If client build as make BUILD_RDMA=yes, then RDMA works as one build-in feature, and user does not need to run the --loadmodule, its bind address and rdma-port works same as Valkey conf file. If client just build as "make" command simply, then RDMA will be disable in Valkey The reason why we prefer this way is that we would like to allow users to choose RDMA or not.
  1. If RDMA can bind a different IP address, I think you can add one extra parameter in valkey.conf rdma-ip

What about rdma.port? Because for module mode, we have already used CONFIG Set rdma.port 6380 to rebind dynamically.

  1. I notice your PR README.md file, The line 172 and 182, it is "--loadmodule src/valkey-rdma.so bind=192.168.122.100 port=6379", and The line 176, you write "10.2.16.101:6379> CONFIG SET rdma-port 6380" I am confused. Can I understand the port=6379 and rdma-port 6380 are the same meaning, they are rdma-port So line 172 and 182, it should be "--loadmodule src/valkey-rdma.so bind=192.168.122.100 rdma-port=6379"? Please connect me if I am wrong, Thanks.

Sorry, 10.2.16.101:6379> CONFIG SET rdma-port 6380 is a mistake, it should be 192.168.122.100:6379> CONFIG SET rdma.port 6380

For example:

./src/valkey-server \
                 --port 6379 --bind=aa.aa.aa.aa \
                 --loadmodule src/valkey-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy

--port 6379 --bind=aa.aa.aa.aa applies TCP only. --loadmodule src/valkey-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy loads a module (src/valkey-rdma.so) with module parameters (port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy), so it applies RDMA only. Then we have runtime config rdma.bind and rdma.port to rebind.

  1. I suggest you add another one argument in Valkey.conf: rdma-port 6379

loadmodule src/valkey-rdma.so port=6379 bind=* is supported in valkey.conf under the current version.

According to 2, 3 and 4, for RDMA module mode, rdma.bind and rdma.port are used. For RDMA builtin mode, rdma-port and rdma-bind are used, the two style may confuse users, also there is no need to expose RDMA is supported by builtin or module.

By the way, to support rdma.bind in valkey.conf:

  • add rdma_port in struct valkeyServer , and add rdma.bind bind and related handler functions (like rdmaApplyListener) into config.c, like TLS. Frankly, this looks ugly.
  • export struct standardConfig and int registerConfigValue(const char *name, const standardConfig *config, int alias), then register rdma.bind and rdma.port in rdma.c

Do you have any suggestions?

  1. Is there a way to add a test for this PR?, i mean adding a tcl file.

It's easy to setup a RDMA interface by software RoCE(linux supports RXE), and launch a valkey-server with RDMA port listening. However, TCL has no support for RDMA, so we have no TCL RDMA client to test. Once the valkey-server supports RDMA, I'll support hiredis/valkey-cli/valkey-benchmark RDMA ASAP. Then we can auto-test RDMA by valkey-server and valkey-cli, valkey-benchmark together.

Thanks

pizhenwei avatar Jun 27 '24 03:06 pizhenwei

@pizhenwei I just learned more RDMA knowledge so far, and I have the following questions and requests for this PR:

  1. Currently, there are three technologies that support RDMA: InfiniBand, Ethernet RoCE and Ethernet iWARP (Reference from NVDIA DOCS). InfiniBand network facilitates data transfer through InfiniBand adapters or switches. The key parts are Subnet Manager (SM), InfiniBand network cards, InfiniBand switches, and InfiniBand cables. RoCe and IWARP implementations are based on Ethernet and different kinds of network card and switches.

My question is: If user wants to test your RDMA over Valkey, Should user have the InfiniBand network hardware support (Since I saw you mentioned IB in the PR) OR this PR works as an API and supports all these 3 technologies?

Right, RDMA supports InfiniBand, RoCE v1, RoCE v2 and iWARP as underlaying transport layer. rdma-core has common abstract API, we don't need to care the underlaying hardware. In fact, I tested it on RoCE v2 hardware, and Alibaba Could instance supports iWARP(see test report).

  1. Can you update this feature work as the following scenario: If client build as make BUILD_RDMA=yes, then RDMA works as one build-in feature, and user does not need to run the --loadmodule, its bind address and rdma-port works same as Valkey conf file. If client just build as "make" command simply, then RDMA will be disable in Valkey The reason why we prefer this way is that we would like to allow users to choose RDMA or not.
  1. If RDMA can bind a different IP address, I think you can add one extra parameter in valkey.conf rdma-ip

What about rdma.port? Because for module mode, we have already used CONFIG Set rdma.port 6380 to rebind dynamically.

  1. I notice your PR README.md file, The line 172 and 182, it is "--loadmodule src/valkey-rdma.so bind=192.168.122.100 port=6379", and The line 176, you write "10.2.16.101:6379> CONFIG SET rdma-port 6380" I am confused. Can I understand the port=6379 and rdma-port 6380 are the same meaning, they are rdma-port So line 172 and 182, it should be "--loadmodule src/valkey-rdma.so bind=192.168.122.100 rdma-port=6379"? Please connect me if I am wrong, Thanks.

Sorry, 10.2.16.101:6379> CONFIG SET rdma-port 6380 is a mistake, it should be 192.168.122.100:6379> CONFIG SET rdma.port 6380

For example:

./src/valkey-server \
                 --port 6379 --bind=aa.aa.aa.aa \
                 --loadmodule src/valkey-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy

--port 6379 --bind=aa.aa.aa.aa applies TCP only. --loadmodule src/valkey-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy loads a module (src/valkey-rdma.so) with module parameters (port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy), so it applies RDMA only. Then we have runtime config rdma.bind and rdma.port to rebind.

  1. I suggest you add another one argument in Valkey.conf: rdma-port 6379

loadmodule src/valkey-rdma.so port=6379 bind=* is supported in valkey.conf under the current version.

According to 2, 3 and 4, for RDMA module mode, rdma.bind and rdma.port are used. For RDMA builtin mode, rdma-port and rdma-bind are used, the two style may confuse users, also there is no need to expose RDMA is supported by builtin or module.

By the way, to support rdma.bind in valkey.conf:

  • add rdma_port in struct valkeyServer , and add rdma.bind bind and related handler functions (like rdmaApplyListener) into config.c, like TLS. Frankly, this looks ugly.
  • export struct standardConfig and int registerConfigValue(const char *name, const standardConfig *config, int alias), then register rdma.bind and rdma.port in rdma.c

Do you have any suggestions?

  1. Is there a way to add a test for this PR?, i mean adding a tcl file.

It's easy to setup a RDMA interface by software RoCE(linux supports RXE), and launch a valkey-server with RDMA port listening. However, TCL has no support for RDMA, so we have no TCL RDMA client to test. Once the valkey-server supports RDMA, I'll support hiredis/valkey-cli/valkey-benchmark RDMA ASAP. Then we can auto-test RDMA by valkey-server and valkey-cli, valkey-benchmark together.

Thanks

Thanks for your quick reply.
For the question 2,3,4,5, the challenge is in Valkey.conf, all the parameter formats are "XXXX-XXX-XXX value", there is no parameter like XXXX.XXX, we do not want to break this style. So Could you please unify the RDMA build-in mode and module mode as format XXXX-XXX-XXX? Thanks

hwware avatar Jun 27 '24 14:06 hwware

@hwware

[snip]

Thanks for your quick reply. For the question 2,3,4,5, the challenge is in Valkey.conf, all the parameter formats are "XXXX-XXX-XXX value", there is no parameter like XXXX.XXX, we do not want to break this style. So Could you please unify the RDMA build-in mode and module mode as format XXXX-XXX-XXX? Thanks

I'm afraid XXXX-XXX-XXX is not supported currently, 'xxx.yyy' is generated by module core. see:

void addModuleStringConfig(const char *module_name, const char *name, int flags, void *privdata, sds default_val) {
    sds config_name = sdscatfmt(sdsempty(), "%s.%s", module_name, name);
    sds config_dummy_address;
    standardConfig module_config =
        createSDSConfig(config_name, NULL, flags | MODULE_CONFIG, 0, config_dummy_address, default_val, NULL, NULL);
    module_config.data.sds.config = NULL;
    module_config.privdata = privdata;
    registerConfigValue(config_name, &module_config, 0);
}

We don't want to break module format rule, right? Module 'rdma' creates config 'bind', then we create both 'rdma.bind' and 'rdma-bind'(as alias), is this acceptable?

pizhenwei avatar Jun 28 '24 02:06 pizhenwei

Wow, it really does a lot of auto detection.

I have only a few minor comments on the script.

I don't know yet which is the best way to run this script, from its own CI job or from within the existing TCL framework or from the C unit tests under src/unit/.

This test does not looks like "unit test", so I prefer to separate this from src/unit/.

pizhenwei avatar Jul 04 '24 11:07 pizhenwei

There seems to be no /sys/class/infiniband/ in GitHub's runners. What can we do? If these machines don't support RDMA, can we run some VM with emulated RDMA?

zuiderkwast avatar Jul 04 '24 13:07 zuiderkwast

In the core team meeting, we decided to accept the feature for Valkey 8. :tada:

We'll mark the feature as "experimental", which means two things:

  • It's not compiled and enabled by default.
  • It can be changed or removed in any minor or major version.

zuiderkwast avatar Jul 09 '24 08:07 zuiderkwast

Can you change this auto-detect in the test script? Magic stuff is a bit complex and dangerous. Users don't want to run a test as root and it changes the kernel modules. :)

I think the test shouldn't run as root.

Can we move the insmod/rmmod and auto-detect to a separate Python script that is used by the GitHub CI?

Hi @zuiderkwast

Because we are almost there, I force pushed this series:

Merge these 3 patches into 1, also modify a little as you suggested:
0001-Introduce-RDMA-transport.patch
0002-Typo-fix.patch
0003-Compat-e4c1f6d45af9-Replace-client-flags-to-bitfield.patch


Add Makefile rather than typing command gcc:
0004-Add-RDMA-test-program.patch

Separate run.py into run.py (just run test without root privilege required) and rdma_env.py (root privilege required), CI scripts also changes a little:
0005-tests-rdma-Add-test-script.patch
0006-tests-rdma-Get-abspath-instead-of-getcwd.patch
0007-CI-support-RDMA-test.patch
0008-tests-rdma-install-rdma_rxe.ko.patch

Could you please take a look at this series? After separating script, the CI machine kernel reports error. I try to enable kernel dynamic tracing, but it does not allow. It may takes more time to find the reason, continue tomorrow.

Thanks a lot for your review suggestions and patience, it's a long and interesting journey!

pizhenwei avatar Jul 09 '24 13:07 pizhenwei

After separating script, the CI machine kernel reports error.

This is weird. Do you think GitHub restores the kernel after each program exits? If that's the case, then I'm fine with running this from just one script, but with some extra option to install the kernel module and run the test, such as tests/rdma/run.py --install-rxe.

zuiderkwast avatar Jul 09 '24 18:07 zuiderkwast

Thanks a lot for your review suggestions and patience, it's a long and interesting journey!

Yes, indeed. Thanks a lot for your patience!

zuiderkwast avatar Jul 09 '24 18:07 zuiderkwast

After separating script, the CI machine kernel reports error.

This is weird. Do you think GitHub restores the kernel after each program exits? If that's the case, then I'm fine with running this from just one script, but with some extra option to install the kernel module and run the test, such as tests/rdma/run.py --install-rxe.

This seems works. There are still two scripts run.py and rdma_env.py. Please take a look, thanks a lot!

pizhenwei avatar Jul 10 '24 02:07 pizhenwei

Calling rdma_env.py from run.py looks good. It's a good separation.

I have only some minor comments, mainly that we should write everywhere that it's experimental. Then, (if nobody else has more comments) I think it can be merged. Thanks!

Thanks for your suggestions! Apply all your suggestions. Also separate 0006-tests-rdma-setup-RXE-in-.-runtest-rdma-install-rxe.patch into the previous two, then these patches get organized in order.

pizhenwei avatar Jul 12 '24 02:07 pizhenwei