daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-16908 object: firewall support

Open jxiong opened this issue 9 months ago • 45 comments

Introduces support for DAOS clients operating behind a firewall. Previously, client-side firewalls had to be disabled as servers initiated connections. This feature provides a robust solution for clients operating in cloud environments.

This feature can be optionally enabled by setting client_behind_firewall: true in daos_server.yaml (disabled by default). When enabled, servers will no longer attempt to connect back to clients. Instead, they return a DER_RECONNECT error, prompting clients to initiate new connections to involved servers via ping RPC, and then retry the original RPC.

Leverages recent enhancements in underlying communication libraries:

  • libfabric: PRs # 10912, # 10922, # 10949, # 10955
  • Mercury: PRs # 787, # 793

Test performed:

compatibility and function tests

  • the test involves 3 server instances with:
    • "Old Server": instances without this PR
    • "New Server w/ firewall": instance with this PR and client_hehind_client set to true in daos_server.yml
    • "New Server w/o firewall": instance with this PR and client_hehind_client set to false in daos_server.yml
  • meanwhile, we started 2 clients:
    • "Old Client": client doesn't have this PR
    • "New Client": client with this PR
  • test is performed by simple fio via dfuse mountpoint
  • the test results are as follows:
+-------------------------+------------+------------+
|                         | Old Client | New Client | 
+-------------------------+------------+------------+
| Old Server              | passed     | passed     | 
| New Server w/  firewall | passed     | passed     | 
| New Server w/o firewall | passed     | passed     | 
+-------------------------+------------+------------+

In the tests, we made sure that all fio runs can be finished successfully. And in the combination of "New Client" and "New Server w/ firewall", servers won't initiate any connecting request to the client. In particular, the iptables rule iptables -A INPUT -s <server_ip> -p tcp --syn -j DROP is applied on the client to block any TCP syn packet from servers.

  • we also did the compatibility test with daos_server.yml to make sure the new software can recognize old version of daos_server.yml and client_hehind_firewall is indeed set to false by default. However, old version of software will run into errors if the daos_server.yml has client_behind_firewall setting because it didn't understand the option.

Signed-off-by: Jinshan Xiong [email protected] Signed-off-by: Yokesh Jayakumar [email protected] Signed-off-by: Jeff Olivier [email protected]

Steps for the author:

  • [x] Commit message follows the guidelines.
  • [x] Appropriate Features or Test-tag pragmas were used.
  • [x] Appropriate Functional Test Stages were run.
  • [x] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [x] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [x] Gatekeeper requested (daos-gatekeeper added as a reviewer).

jxiong avatar May 23 '25 22:05 jxiong

Ticket title is 'Modify DAOS to use new mercury changes to implement improved firewall handling' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-16908

github-actions[bot] avatar May 23 '25 22:05 github-actions[bot]

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/2/execution/node/1427/log

daosbuild3 avatar May 24 '25 09:05 daosbuild3

Introduces support for DAOS clients operating behind a firewall. Previously, client-side firewalls had to be disabled as servers initiated connections. This feature provides a robust solution for clients operating in cloud environments.

This feature can be optionally enabled by setting disable_client_firewall_mode: true in daos_server.yaml (disabled by default). When enabled, servers will no longer attempt to connect back to clients. Instead, they return a DER_RECONNECT error, prompting clients to initiate new connections to involved servers via ping RPC, and then retry the original RPC.

Leverages recent enhancements in underlying communication libraries:

  • libfabric: PRs # 10912, # 10922, # 10949, # 10955
  • Mercury: PRs # 787, # 793

Signed-off-by: Jinshan Xiong [email protected] Signed-off-by: Yokesh Jayakumar [email protected] Signed-off-by: Jeff Olivier [email protected]

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

I think we should change the flag name. I find disable_client_firewall_mode to be incredibly difficult for my brain to wrap around.

What about

assume_firewall: true

jolivier23 avatar May 28 '25 18:05 jolivier23

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/4/execution/node/1468/log

daosbuild3 avatar May 30 '25 17:05 daosbuild3

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/6/testReport/

daosbuild3 avatar Jul 08 '25 18:07 daosbuild3

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/9/testReport/

daosbuild3 avatar Jul 09 '25 20:07 daosbuild3

Still don't love the different internal/external names for this feature, but I'll live with it. (are we sure "client_firewall" wouldn't suffice as a happy medium?)

Thanks for the suggestion. I always prefer short names if possible. However, client_firewall does not seem to be a right term. client_behind_firewall is carefully chosen and I believe there is no ambiguity when a user sees this word in the configuration file. Originally it came with a very confusing name and that's why there exists a different name in user-facing term and implementation.

jxiong avatar Jul 23 '25 21:07 jxiong

This is failing build stages, possibly due to recent CI issues, so I restarted Jenkins testing

daltonbohning avatar Jul 24 '25 19:07 daltonbohning

@jxiong Looks like CI failed again. Can you merge the latest master and repush please?

kjacque avatar Jul 24 '25 23:07 kjacque

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/13/testReport/

daosbuild3 avatar Jul 25 '25 05:07 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/13/testReport/

daosbuild3 avatar Jul 25 '25 06:07 daosbuild3

This has not finished CI testing so cannot be merged yet

daltonbohning avatar Jul 25 '25 20:07 daltonbohning

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/14/testReport/

daosbuild3 avatar Jul 30 '25 02:07 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16429/15/testReport/

daosbuild3 avatar Jul 31 '25 00:07 daosbuild3

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect

daosbuild3 avatar Aug 05 '25 01:08 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect

daosbuild3 avatar Aug 05 '25 02:08 daosbuild3

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect

daosbuild3 avatar Aug 05 '25 02:08 daosbuild3

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16429/16/display/redirect

daosbuild3 avatar Aug 05 '25 02:08 daosbuild3

Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/341/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/393/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/395/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/398/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/370/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/428/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/18/execution/node/376/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/19/execution/node/294/log

daosbuild3 avatar Sep 10 '25 14:09 daosbuild3

Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/371/log

daosbuild3 avatar Sep 10 '25 15:09 daosbuild3

Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/360/log

daosbuild3 avatar Sep 10 '25 15:09 daosbuild3

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/395/log

daosbuild3 avatar Sep 10 '25 15:09 daosbuild3

Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16429/20/execution/node/345/log

daosbuild3 avatar Sep 10 '25 15:09 daosbuild3