image-builder icon indicating copy to clipboard operation
image-builder copied to clipboard

updating dynamic port range for windows to address some collisions

Open marosset opened this issue 2 years ago • 7 comments

Signed-off-by: Mark Rossetti [email protected]

What this PR does / why we need it: This fixes some issues where ports inside the current dynamic range (33000-65535) are being used by Windows components. This can random issues if container workloads are allocated ports that are already in use. Lots of details in https://github.com/kubernetes/kubernetes/issues/107114

Which issue(s) this PR fixes (optional, in fixes #(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes https://github.com/kubernetes/kubernetes/issues/107114

Additional context Add any other context for the reviewers

/sig windows

marosset avatar Aug 11 '22 17:08 marosset

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: marosset Once this PR has been reviewed and has the lgtm label, please assign jsturtevant for approval by writing /assign @jsturtevant in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 11 '22 17:08 k8s-ci-robot

@marosset: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-azure-vhds 5b3b6901f623a59dfbeb8f9686d7387c5d0eb9c7 link true /test pull-azure-vhds

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Aug 11 '22 18:08 k8s-ci-robot

@marosset This goss check need to be modified as well

@knabben @jsturtevant any thoughts on the new port ranges?

kkeshavamurthy avatar Aug 23 '22 17:08 kkeshavamurthy

@marosset can you pls address the goss failures, we can get this merged.

kkeshavamurthy avatar Sep 09 '22 16:09 kkeshavamurthy

this should resolve some flakes we are seeing in our upstream tests. FYI @daschott

jsturtevant avatar Sep 09 '22 18:09 jsturtevant

@jsturtevant thanks for the ping here. I was looking back at your concern here: https://github.com/kubernetes/kubernetes/issues/107114#issuecomment-1122654452

This looks great if you are not experiencing port exhaustion. But if you are running out of ports due to heavy traffic load it might also make sense to define exclusion zone 30000 - 34000 and expand the range as needed (e.g. all the way down to 16385). Another alternative is to disable the SNAT, this would also reduce number of ports consumed. But if you are not experiencing any port exhaustion in the e2e test clusters, then please ignore :)

daschott avatar Sep 09 '22 18:09 daschott

@jsturtevant thanks for the ping here. I was looking back at your concern here: kubernetes/kubernetes#107114 (comment)

This looks great if you are not experiencing port exhaustion. But if you are running out of ports due to heavy traffic load it might also make sense to define exclusion zone 30000 - 34000 and expand the range as needed (e.g. all the way down to 16385). Another alternative is to disable the SNAT, this would also reduce number of ports consumed. But if you are not experiencing any port exhaustion in the e2e test clusters, then please ignore :)

From what I've seen (https://github.com/kubernetes/kubernetes/issues/107114#issuecomment-1208116303), defining exclusion zones does not prevent those ports from being used, but rather raise errors if there is an attempt at using them. This is the reason we're seeing An attempt was made to access a socket in a way forbidden by its access permissions errors in the Kubernetes CI signal, requests inside the containers are trying to use ports from the excluded ranges:

netsh int ipv4 show excludedportrange tcp

Protocol tcp Port Exclusion Ranges

Start Port    End Port
----------    --------
      5985        5985
      5986        5986
     33300       33363
     33364       33427
     33487       33550
     33551       33614
     33615       33678
     33679       33742
     47001       47001

* - Administered port exclusions.

IMO, that should have happened in the first place. Updating the port range to 34000 basically excludes ports from being used that are in the excludedportrange above. The only one that would remain would be 47001.

claudiubelu avatar Oct 04 '22 16:10 claudiubelu

/retest

marosset avatar Dec 06 '22 17:12 marosset

@marosset Need to update the goss tests:

https://github.com/kubernetes-sigs/image-builder/blob/1d4b0445b364712c3ab5d3c6f721b808fa0bfa2e/images/capi/packer/goss/goss-command.yaml#L158-L164

jsturtevant avatar Dec 06 '22 18:12 jsturtevant

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: marosset Once this PR has been reviewed and has the lgtm label, please assign fabriziopandini for approval by writing /assign @fabriziopandini in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Dec 06 '22 19:12 k8s-ci-robot

Should we add the same port range for UDP (except shorten it for azure due to UDP restriction? https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-faq#what-protocols-can-i-use-within-vnets) Maybe we can follow up in a seperate PR since we don't have UDP and would need to only do this port exclusion for Azure....

jsturtevant avatar Dec 06 '22 20:12 jsturtevant

@jsturtevant thanks for the ping here. I was looking back at your concern here: https://github.com/kubernetes/kubernetes/issues/107114#issuecomment-1122654452

This looks great if you are not experiencing port exhaustion. But if you are running out of ports due to heavy traffic load it might also make sense to define exclusion zone 30000 - 34000 and expand the range as needed (e.g. all the way down to 16385). Another alternative is to disable the SNAT, this would also reduce number of ports consumed. But if you are not experiencing any port exhaustion in the e2e test clusters, then please ignore :)

From what I've seen (https://github.com/kubernetes/kubernetes/issues/107114#issuecomment-1208116303), defining exclusion zones does not prevent those ports from being used, but rather raise errors if there is an attempt at using them. This is the reason we're seeing An attempt was made to access a socket in a way forbidden by its access permissions errors in the Kubernetes CI signal, requests inside the containers are trying to use ports from the excluded ranges:

@daschott Any ideas on why port exclusion doesn't seem to work in our case? It looks like your suggestion of going all the way down to 16385 is what AKS is doing https://github.com/Azure/AgentBaker/blob/baaf54f3b2342d5f41ae76ecd09b082487898063/staging/cse/windows/configfunc.ps1#L116-L133 and I would expect they would see the same thing @claudiubelu is seeing in https://github.com/kubernetes/kubernetes/issues/107114#issuecomment-1208116303

jsturtevant avatar Dec 06 '22 20:12 jsturtevant

@jsturtevant Is the app picking a random port from the ephemeral port range, or is there some mechanism to define a port range in your application?

There is a script aka.ms/networkhealth which can be used to view how many port ranges are available, do your machines have enough ports available? How often does this issue occur?

daschott avatar Dec 07 '22 22:12 daschott

do your machines have enough ports available? How often does this issue occur?

I wouldn't think there is that many connections running in the test environment (@claudiubelu any thoughts on that?). It occurs every 30-40 runs.

jsturtevant avatar Dec 07 '22 22:12 jsturtevant

/lgtm

jsturtevant avatar Dec 12 '22 21:12 jsturtevant

/approve

jsturtevant avatar Dec 12 '22 22:12 jsturtevant

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsturtevant, marosset

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Dec 12 '22 22:12 k8s-ci-robot