buildah icon indicating copy to clipboard operation
buildah copied to clipboard

Make buildah's handling of ulimits match podman

Open chris-reeves opened this issue 1 year ago • 21 comments

There was a bug in #5275 which assumed that non-rootless environments always had the ability to increase limits past the hard limit. This is not always the case. This was resolved by more accurately reflecting podman's handling of ulimits. The initial fix for this didn't account for the fact that privileged users can (usually) increase limits past the hard limit. This wasn't picked up when the tests were run locally, so they have been modified to ensure that this particular failure mode would be identified in future.

What type of PR is this?

/kind bug

What this PR does / why we need it:

There was a bug in #5275 which assumed that non-rootless environments always had the ability to increase limits past the hard limit. This is not always the case. For example:

[2/2] STEP 8/13: RUN addgroup -g 1000 -S app && adduser -u 1000 -h /app -S -G app app
Error: building at STEP "RUN addgroup -g 1000 -S app && adduser -u 1000 -h /app -S -G app app": setting "RLIMIT_NPROC" limit to soft=104857600,hard=104857600 (was soft=31723,hard=31723): operation not permitted

This was resolved by more accurately reflecting podman's handling of ulimits.

How to verify it

The original bug can be replicated by running the test suite (specifically bats tests/run.bats -f limit) in a non-rootless mode where nproc limits are less than RLimitDefaultValue (1048576). For me the easiest way to do this was to run the test as root with RLimitDefaultValue increased to 10485760.

Which issue(s) this PR fixes:

None - no issue has been opened for this bug.

Special notes for your reviewer:

No tests have been added for the scenario that was originally encountered (non-rootless environment where the user doesn't have privileges to increase limits past the hard limit) due to the difficulty in manipulating limits and recreating this scenario in a test environment, however the tests which fail when RLimitDefaultValue is increased to 10485760 now pass with this fix in place.

The initial fix for this didn't account for the fact that privileged users can (usually) increase limits past the hard limit. This wasn't picked up when the tests were run locally, so they have been modified to ensure that this particular failure mode would be identified in future.

The PR does make a call to Setrlimit() without checking the returned error value. This is intentional as the way podman handles the failure of this call is to try again with the current hard limit, but we effectively do this a little later with the call to AddProcessRlimits() making another get/set round trip redundant.

Does this PR introduce a user-facing change?

None

chris-reeves avatar Jun 14 '24 14:06 chris-reeves

Ephemeral COPR build failed. @containers/packit-build please check.

The initial fix for this didn't account for the fact that privileged users can (usually) increase limits past the hard limit. This wasn't picked up when the tests were run locally, so they have been modified to ensure that this particular failure mode would be identified in future.

chris-reeves avatar Jun 17 '24 00:06 chris-reeves

@chris-reeves thanks for the PR! The tests are all very unhappy and it looks like you need to update your branch.

TomSweeneyRedHat avatar Jun 17 '24 19:06 TomSweeneyRedHat

@chris-reeves thanks for the PR! The tests are all very unhappy and it looks like you need to update your branch.

@TomSweeneyRedHat It looks like this was just the linter complaining. I did wonder whether not checking Setrlimit()'s return value would annoy the linter so I ran make lint locally before committing and it didn't complain. I've tried again and can't replicate the lint output that Cirrus is returning, so rather than trying to set lint ignores blindly or seeing if I can get away with assigning to _, I've taken a fairly blunt approach and simply write a debug-level log message if this fails.

Also rebased and signed the commit that I missed.

chris-reeves avatar Jun 17 '24 20:06 chris-reeves

@chris-reeves I kicked off the test again, a quick look seemed to show network hiccups. The branch looks like it needs a rebase too.

TomSweeneyRedHat avatar Jun 20 '24 21:06 TomSweeneyRedHat

@chris-reeves I kicked off the test again, a quick look seemed to show network hiccups. The branch looks like it needs a rebase too.

Could you point me towards these network hiccups? The only two failing builds are the same two that appear to fail for most PRs (and this appears to be an issue with missing build tools on those platforms).

I can rebase again if you like, but it looks like you're taking a branch/merge approach so I'm not sure what it gives us (other than another build to wait on!).

chris-reeves avatar Jun 20 '24 22:06 chris-reeves

LGTM

rhatdan avatar Jun 21 '24 08:06 rhatdan

Thanks @chris-reeves

rhatdan avatar Jun 21 '24 08:06 rhatdan

Anything else I can do to help get this PR merged?

chris-reeves avatar Jul 02 '24 22:07 chris-reeves

@nalind @flouthoc @giuseppe @mheon @Luap99 @giuseppe PTAL

rhatdan avatar Jul 15 '24 15:07 rhatdan

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chris-reeves, giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jul 15 '24 20:07 openshift-ci[bot]

@chris-reeves Are you still working on this?

rhatdan avatar Jul 27 '24 09:07 rhatdan

@chris-reeves Are you still working on this?

Sorry, I missed the previous comment (which I've now responded to). I'm still keen to see this merged as our current workaround is rather hacky.

chris-reeves avatar Jul 27 '24 22:07 chris-reeves

A friendly reminder that this PR had no activity for 30 days.

github-actions[bot] avatar Aug 27 '24 00:08 github-actions[bot]

@nalind PTAL

rhatdan avatar Aug 27 '24 09:08 rhatdan

A friendly reminder that this PR had no activity for 30 days.

github-actions[bot] avatar Sep 27 '24 00:09 github-actions[bot]

Could we get this merged in? I'm not sure what the blocker is, but if I can help then do let me know.

chris-reeves avatar Dec 03 '24 12:12 chris-reeves

@nalind are you fine with the last version?

giuseppe avatar Dec 09 '24 10:12 giuseppe

@nalind Do you have any further questions about this, or can we get this merged in?

chris-reeves avatar Feb 03 '25 10:02 chris-reeves

Is there anything I can do to unblock this?

chris-reeves avatar Apr 21 '25 14:04 chris-reeves

Well we need a rebase first, and then we should be able to get it in.

rhatdan avatar Apr 21 '25 14:04 rhatdan

@chris-reeves if you still have any energy for this one, can you rebase please?

TomSweeneyRedHat avatar Nov 18 '25 22:11 TomSweeneyRedHat

@TomSweeneyRedHat Apologies for missing @rhatdan's reply back in April. Rebased as requested. There was a single test failure on one of the integration test runs. The test doesn't look like it is related to this change and the failure appears to be a timeout waiting on a filesystem to unmount (likely during test cleanup). I don't think I have a way of triggering a re-run that particular set of tests.

chris-reeves avatar Nov 20 '25 12:11 chris-reeves

Thanks @chris-reeves ! I've hit rerun for the failing test. I suspect it's a CI flake. @nalind PTAL

TomSweeneyRedHat avatar Nov 20 '25 15:11 TomSweeneyRedHat