noobaa-core icon indicating copy to clipboard operation
noobaa-core copied to clipboard

Enhance TCP Socket handling

Open motorman-ibm opened this issue 3 years ago • 2 comments

More information - Screenshots / Logs / Other output

Follow-up on slack discussion with Guy (and pasting stuff here)

Basic issue - tcp stack handling in current implementation of Noobaa with nodejs limits processing rate of a single Noobaa endpoint to 4GB/sec (roughly). It would be better if a single endpoint could run faster and saturate the physical node - that reduces redundant, independent cache pools required by each of the multiple endpoints in a node. (like the "ls" cache)

  1. worker threads are inside the GPFS stack all the time, so that's good

  2. But the main event loop thread is actually busy with TCP stack all the time:

$ node ../perf-report-noobaa.js perf-script.traces -t=37894
Options: { traces_file: 'perf-script.traces', verbose: false, tid: '37894' }
[93.9%] TCP
  - [53.9%] copy_user_enhanced_fast_string | copyin | _copy_from_iter_full | tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter | new_sync_write | vfs_write
  - [ 1.6%] skb_release_data | __kfree_skb | tcp_clean_rtx_queue | tcp_ack | tcp_rcv_established | tcp_v4_do_rcv | __release_sock | __sk_flush_backlog | tcp_sendmsg_locked
  - [ 1.6%] nft_do_chain | nft_do_chain_ipv4 | nf_hook_slow | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_receive_skb_internal |
  - [ 1.5%] tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter | new_sync_write | vfs_write | ksys_write | do_syscall_64 | entry_SYSCALL_64_after_hwframe | _
  - [ 1.2%] pskb_expand_head | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_receive_skb_internal | br_pass_frame_up | br_handle_fr
  - [ 1.1%] _raw_spin_lock | sch_direct_xmit | __dev_queue_xmit | ip_finish_output2 | ip_output | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_
  - [ 0.8%] __list_del_entry_valid | get_page_from_freelist | __alloc_pages_nodemask | skb_page_frag_refill | sk_page_frag_refill | tcp_sendmsg_locked | tcp_sendmsg | sock_
  - [ 0.8%] __nf_conntrack_find_get | nf_conntrack_in | nf_hook_slow | br_nf_pre_routing | nf_hook_slow | br_handle_frame | __netif_receive_skb_core | process_backlog | net
  - [ 0.7%] get_page_from_freelist | __alloc_pages_nodemask | skb_page_frag_refill | sk_page_frag_refill | tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter
  - [ 0.7%] fib_table_lookup | __fib_validate_source | fib_validate_source | ip_route_input_slow | ip_route_input_rcu | ip_route_input_noref | ip_rcv_finish | ip_sabotage_i
  - [ 0.7%] __free_pages_ok | skb_release_data | __kfree_skb | tcp_clean_rtx_queue | tcp_ack | tcp_rcv_established | tcp_v4_do_rcv | __release_sock | __sk_flush_backlog | t
  - [ 0.7%] nft_immediate_eval | nft_do_chain | nft_do_chain_ipv4 | nf_hook_slow | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_re
  - [ 0.6%] fib_table_lookup | ip_route_input_slow | ip_route_input_rcu | ip_route_input_noref | ip_rcv_finish | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_sk
  - [ 0.5%] nft_counter_eval | nft_do_chain | nft_do_chain_ipv4 | nf_hook_slow | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_rece
  - [ 0.5%] copy_user_enhanced_fast_string | copyin | _copy_from_iter_full | tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter | do_iter_readv_writev | do_i
[ 2.6%] INFINIBAND
  - [ 1.3%] mlx5e_sq_xmit | mlx5e_xmit | dev_hard_start_xmit | sch_direct_xmit | __dev_queue_xmit | ip_finish_output2 | ip_output | ip_forward | ip_sabotage_in | nf_hook_sl
  - [ 0.3%] mlx5e_xmit | dev_hard_start_xmit | sch_direct_xmit | __dev_queue_xmit | ip_finish_output2 | ip_output | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __
[ 2.4%] NODEJS-V8
  - [ 0.1%] [unknown] | Builtins_AsyncFunctionAwaitResolveClosure | Builtins_PromiseFulfillReactionJob | Builtins_RunMicrotasks | Builtins_JSRunMicrotasksEntry | v8::intern
[ 1.0%] OTHER
  - [ 0.3%] __x86_indirect_thunk_rax | __pthread_disable_asynccancel | [unknown]
[ 0.0%] VFS
  - [ 0.0%] __fsnotify_parent | vfs_write | ksys_write | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __pthread_disable_asynccancel | [unknown]
[ 0.0%] PERF-EVENTS
  - [ 0.0%] native_write_msr | __intel_pmu_enable_all.constprop.25 | event_function | remote_function | flush_smp_call_function_queue | smp_call_function_single_interrupt |

Per our discussion - nodejs cluster is not a viable solution - since it forks, it has the same issue as multiple endpoints

From Guy -> What I'm thinking is that instead of using the cluster module, which hands over the entire request to the worker and therefore cannot reuse the caches, we can selectively do the hand over on read/write flows once we start streaming the data. Passing an http/tcp socket to a child process is simple - child_process_example_sending_a_socket_object so we can just execute the NSFS read-loop and write-loop in a worker that will off load the TCP work from main thread.

This will probably require us to boost the number of worker threads, but that will be empirical testing after this change is implemented

motorman-ibm avatar Apr 26 '21 21:04 motorman-ibm

This should move to enhancement. You (as a team) can also choose to close it. My understanding is you run fast enough per pod on a performance system. If you ever end up having to look at a per pod performance bottleneck I think it is better to have left this somewhere just so we document a possible solution.

motorman-ibm avatar Apr 13 '22 13:04 motorman-ibm