ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCP/CORE: Fix for activate count

Open iyastreb opened this issue 1 year ago • 0 comments

There is a bug introduced in recent commit https://github.com/openucx/ucx/pull/9650, which is exposed in 2 ways:

  • assertion failure in debug build in function ucp_worker_iface_deactivate : wiface->activate_count > 0
  • in release mode process hangs Revert of that commit solves the connectivity issue both for daemon CI use case and RFI.

The issue happens on the receiver side, only when there are multiple connections from the same sender. In this case there are multiple wireup_eps created and they reuse the same ep_config. However, increments and decrements of interface reference counter are not consistent:

First sender connect:
  ucp_worker_get_ep_config -> Load new config
    ucp_worker_ep_config_short_init -> Select protocols
      ucp_wiface_process_for_each_lane -> activate_count ++
  ucp_ep_set_cfg_index -> Set config to EP
    ucp_ep_config_activate_worker_ifaces
      ucp_wiface_process_for_each_lane -> activate_count ++
First sender disconnect
  ucp_wireup_process_request -> Handle disconnect event
    ucp_ep_set_cfg_index -> Update config, remove active interfaces
      ucp_wiface_process_for_each_lane -> activate_count --
  ucp_wireup_eps_progress -> Wireup EP gets deleted
    ucp_proxy_ep_replace
      ucp_wireup_ep_t_cleanup
        ucp_wireup_ep_discard_aux_ep
          ucp_worker_iface_unprogress_ep -> activate_count --  

Now the second client connects and reuses the same ep_config:
  ucp_worker_get_ep_config -> Reuse existing config
    No increment happens here, just return <<< Fix here?
  ucp_ep_set_cfg_index -> Set config to EP
    ucp_ep_config_activate_worker_ifaces
      ucp_wiface_process_for_each_lane -> activate_count ++
Second sender disconnect
  ucp_wireup_process_request -> Handle disconnect event
    ucp_ep_set_cfg_index -> Update config, remove active interfaces
      ucp_wiface_process_for_each_lane -> activate_count -- (HERE refcount reaches 0, interface gets deactivated)
  ucp_wireup_eps_progress -> Wireup EP gets deleted
    ucp_proxy_ep_replace
      ucp_wireup_ep_t_cleanup
        ucp_wireup_ep_discard_aux_ep
          ucp_worker_iface_unprogress_ep -> activate_count -- (Failure: activate_count == 0)

As we agreed after internal discussion, the idea is to avoid iface activation during config creation. Instead, we do iface activation on ucp_ep_set_cfg_index. Also add missing iface activation for AUX EP

iyastreb avatar May 06 '24 15:05 iyastreb