cloud-pipeline icon indicating copy to clipboard operation
cloud-pipeline copied to clipboard

Pod nework bandwidth usage

Open SilinPavel opened this issue 1 year ago • 2 comments

Background Let's introduce mechanism of the monitoring, notification and (if possible) regulation of usage of network by pods.

This mechanism should be similar to the one we have for IDLE RUN monitoring. Network usage data can be obtained from heapster elk

Realization of the monitoring could be done in the following way:

  • Introduce set of System Preferences to be able to configure behaviour (system.pod.bandwidth.limit, system.pod.bandwidth.action, system.pod.bandwidth.action.backoff.period)
  • Introduce new email notification HIGH_CONSUMED_NETWORK_BANDWIDTH
  • Based on the new preferences implement similar logic to idle run monitoing
    • if system.pod.bandwidth.limit== 0, skip check
    • Notify users (based on email notification setting) with an email if pod consume network > system.pod.bandwidth.limit for configured period of time + put a label on the run
    • If after system.pod.bandwidth.action.backoff.period consumption still in place - perform an action

SilinPavel avatar Jul 10 '24 15:07 SilinPavel

Additional consideration about an action

Lets implement the next approach:

  • New API method POST /run/{id}/network/limit?boundary=<int>

    • This method will set a special tag for a run based on boundary param: NETWORK_LIMIT: <boundary>
    • Only Admins should be able to call it and set this label
  • Scheduled daemon on the API that will perform actual limiting

    • Daemon should be active only of API Leader
    • Daemon will run each <system.pod.bandwidth.limit.daemon.timeout> (to reconfigure daemon start, please, reuse Observable mechanism for SystemPreferences, f.e. see AbstractSchedulingManager )
    • If run marked as NETWORK_LIMIT: <boundary>, daemon should execute DockerContainerOperationManager which then will execute ssh command on a target node to actually limit bandwidth
    • After successful run of limiting script daemon will set additional tag NETWORK_LIMIT_<SystemPreferences.SYSTEM_RUN_TAG_DATE_SUFFIX>: <timestamp> to give a hint that limitation was actually performed and when
    • If run doesn't have NETWORK_LIMIT: <boundary> anymore, daemon should disable limitation also
  • Script to limit bandwidth on a node

    • Should work similar to the commit, get_container_size, etc. scripts
    • To bound traffic should use https://github.com/magnific0/wondershaper

SilinPavel avatar Jul 18 '24 14:07 SilinPavel

Docs added via https://github.com/epam/cloud-pipeline/pull/3669 and located here.

NShaforostov avatar Dec 27 '24 18:12 NShaforostov