DAOS-16908 object: firewall support
Introduces support for DAOS clients operating behind a firewall. Previously, client-side firewalls had to be disabled as servers initiated connections. This feature provides a robust solution for clients operating in cloud environments.
This feature can be optionally enabled by setting client_behind_firewall: true in daos_server.yaml (disabled by default). When enabled, servers will no longer attempt to connect back to clients. Instead, they return a DER_RECONNECT error, prompting clients to initiate new connections to involved servers via ping RPC, and then retry the original RPC.
Leverages recent enhancements in underlying communication libraries:
- libfabric: PRs # 10912, # 10922, # 10949, # 10955
- Mercury: PRs # 787, # 793
Test performed:
compatibility and function tests
- the test involves 3 server instances with:
- "Old Server": instances without this PR
- "New Server w/ firewall": instance with this PR and
client_hehind_clientset totruein daos_server.yml - "New Server w/o firewall": instance with this PR and
client_hehind_clientset tofalsein daos_server.yml
- meanwhile, we started 2 clients:
- "Old Client": client doesn't have this PR
- "New Client": client with this PR
- test is performed by simple fio via dfuse mountpoint
- the test results are as follows:
+-------------------------+------------+------------+
| | Old Client | New Client |
+-------------------------+------------+------------+
| Old Server | passed | passed |
| New Server w/ firewall | passed | passed |
| New Server w/o firewall | passed | passed |
+-------------------------+------------+------------+
In the tests, we made sure that all fio runs can be finished successfully. And in the combination of "New Client" and "New Server w/ firewall", servers won't initiate any connecting request to the client. In particular, the iptables rule iptables -A INPUT -s <server_ip> -p tcp --syn -j DROP is applied on the client to block any TCP syn packet from servers.
- we also did the compatibility test with daos_server.yml to make sure the new software can recognize old version of daos_server.yml and
client_hehind_firewallis indeed set tofalseby default. However, old version of software will run into errors if the daos_server.yml hasclient_behind_firewallsetting because it didn't understand the option.
Signed-off-by: Jinshan Xiong [email protected] Signed-off-by: Yokesh Jayakumar [email protected] Signed-off-by: Jeff Olivier [email protected]
Steps for the author:
- [x] Commit message follows the guidelines.
- [x] Appropriate Features or Test-tag pragmas were used.
- [x] Appropriate Functional Test Stages were run.
- [x] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [x] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [x] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Modify DAOS to use new mercury changes to implement improved firewall handling' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-16908
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/2/execution/node/1427/log
Introduces support for DAOS clients operating behind a firewall. Previously, client-side firewalls had to be disabled as servers initiated connections. This feature provides a robust solution for clients operating in cloud environments.
This feature can be optionally enabled by setting
disable_client_firewall_mode: trueindaos_server.yaml(disabled by default). When enabled, servers will no longer attempt to connect back to clients. Instead, they return aDER_RECONNECTerror, prompting clients to initiate new connections to involved servers via ping RPC, and then retry the original RPC.Leverages recent enhancements in underlying communication libraries:
- libfabric: PRs # 10912, # 10922, # 10949, # 10955
- Mercury: PRs # 787, # 793
Signed-off-by: Jinshan Xiong [email protected] Signed-off-by: Yokesh Jayakumar [email protected] Signed-off-by: Jeff Olivier [email protected]
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
I think we should change the flag name. I find disable_client_firewall_mode to be incredibly difficult for my brain to wrap around.
What about
assume_firewall: true
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/4/execution/node/1468/log
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/6/testReport/
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/9/testReport/
Still don't love the different internal/external names for this feature, but I'll live with it. (are we sure "client_firewall" wouldn't suffice as a happy medium?)
Thanks for the suggestion. I always prefer short names if possible. However, client_firewall does not seem to be a right term. client_behind_firewall is carefully chosen and I believe there is no ambiguity when a user sees this word in the configuration file. Originally it came with a very confusing name and that's why there exists a different name in user-facing term and implementation.
This is failing build stages, possibly due to recent CI issues, so I restarted Jenkins testing
@jxiong Looks like CI failed again. Can you merge the latest master and repush please?
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/13/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/13/testReport/
This has not finished CI testing so cannot be merged yet
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/14/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/15/testReport/
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/341/log
Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/393/log
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/395/log
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/398/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/370/log
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/428/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/376/log
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/19/execution/node/294/log
Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/371/log
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/360/log
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/395/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/345/log