cloud-pipeline
cloud-pipeline copied to clipboard
Pod nework bandwidth usage
Background Let's introduce mechanism of the monitoring, notification and (if possible) regulation of usage of network by pods.
This mechanism should be similar to the one we have for IDLE RUN monitoring.
Network usage data can be obtained from heapster elk
Realization of the monitoring could be done in the following way:
- Introduce set of
System Preferencesto be able to configure behaviour (system.pod.bandwidth.limit, system.pod.bandwidth.action, system.pod.bandwidth.action.backoff.period) - Introduce new email notification
HIGH_CONSUMED_NETWORK_BANDWIDTH - Based on the new preferences implement similar logic to
idle runmonitoing- if
system.pod.bandwidth.limit== 0, skip check - Notify users (based on email notification setting) with an email if pod consume network >
system.pod.bandwidth.limitfor configured period of time + put a label on the run - If after
system.pod.bandwidth.action.backoff.periodconsumption still in place - perform an action
- if
Additional consideration about an action
Lets implement the next approach:
-
New API method
POST /run/{id}/network/limit?boundary=<int>- This method will set a special tag for a run based on
boundaryparam:NETWORK_LIMIT: <boundary> - Only Admins should be able to call it and set this label
- This method will set a special tag for a run based on
-
Scheduled daemon on the API that will perform actual limiting
- Daemon should be active only of API Leader
- Daemon will run each <system.pod.bandwidth.limit.daemon.timeout> (to reconfigure daemon start, please, reuse Observable mechanism for SystemPreferences, f.e. see
AbstractSchedulingManager) - If run marked as
NETWORK_LIMIT: <boundary>, daemon should executeDockerContainerOperationManagerwhich then will execute ssh command on a target node to actually limit bandwidth - After successful run of limiting script daemon will set additional tag
NETWORK_LIMIT_<SystemPreferences.SYSTEM_RUN_TAG_DATE_SUFFIX>: <timestamp>to give a hint that limitation was actually performed and when - If run doesn't have
NETWORK_LIMIT: <boundary>anymore, daemon should disable limitation also
-
Script to limit bandwidth on a node
- Should work similar to the
commit,get_container_size, etc. scripts - To bound traffic should use https://github.com/magnific0/wondershaper
- Should work similar to the
Docs added via https://github.com/epam/cloud-pipeline/pull/3669 and located here.