install-darwin: fix _nixbld uids for macOS sequoia
Motivation
Starting in macOS 15 Sequoia, macOS daemon UIDs are encroaching on our default UIDs of 301-332. This commit relocates our range up to avoid clashing with the current UIDs of 301-304 and buy us a little time while still leaving headroom for people installing more than 32 users.
It also adopts GID 350 (same as first UID), since @emilazy pointed out that this will keep our build group from showing up in the Users & Groups interface. (See https://github.com/NixOS/nix/pull/10919#issuecomment-2203507850)
Context
- #10892
- #10912
- #4532
- #4531
Priorities and Process
Add :+1: to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.
Is there any chance we’ll be able to automate this transition on upgrade? Getting the word out to every single user that they need to run a script seems painful. And if we’re going to get everyone to run a migration script, is there any chance we can move the builder group to GID 330 at the same time so it stops showing up in System Settings?
I can try and test this in a VM someday soon.
Is there any chance we’ll be able to automate this transition on upgrade? Getting the word out to every single user that they need to run a script seems painful.
That's a Nix-nix question, I guess. I'm not personally aware of any mechanism for running a stateful script when people update Nix and don't see one on nix upgrade-nix --help (not that this would help people not using the new CLI or who only update their nix via nixos/hm/nix-darwin). We'd probably need to build some user migration logic into the daemon (and it would still only help you if people updated to the new Nix and ran that daemon before they updated to Sequoia).
if we’re going to get everyone to run a migration script, is there any chance we can move the builder group to GID 330 at the same time so it stops showing up in System Settings?
I've wondered why people were mentioning the GID, since we haven't needed to change it previously. (I assumed people were just conflating the 30000 GID we set with the 301 I gather the detsys installer sets?)
I'm less sure about migrating it. Inevitably we can change it, but I don't know if doing so will "break" the existing groups. (If it did, we might have to rework the script a little to just remove the users + group and re-add them all?)
Is there a reason that every nix install needs the same UIDs for these users? Furthermore, is there a reason they need to be consecutive? Does the main nix builder code deal with the UIDs directly or look them up from the names?
It appears from the user reported error in #10912 that it uses the names, in which case the installer would be free to pick whatever UIDs happen to be free on a given OS install when creating them… That would be a simpler solution for the install/upgrade script and it would work everywhere. Trying to pick a empty span is only ever guaranteed to work on a fresh OS install—there's always going to be somebody out there who's got a user in that range created by hand or by some other software.
As far as upgrading goes, perhaps the script should be more nixos/puppet-like (declarative/idempotent) and create the users only if they don't exist.
It would also be super convenient if the code that got the error in #10912 could call the upgrade script and fix it right there, but I don't know if it has the correct permissions at that point in time. It could at least mention your upgrade script or prompt the user to re-install.
Is there a reason that every nix install needs the same UIDs for these users? Furthermore, is there a reason they need to be consecutive? Does the main nix builder code deal with the UIDs directly or look them up from the names?
To the best of my understanding (as someone not very familiar with the codebase for Nix itself), the answer to all of these is: not really.
It appears from the user reported error in #10912 that it uses the names, in which case the installer would be free to pick whatever UIDs happen to be free on a given OS install when creating them… That would be a simpler solution for the install/upgrade script and it would work everywhere. Trying to pick a empty span is only ever guaranteed to work on a fresh OS install—there's always going to be somebody out there who's got a user in that range created by hand or by some other software.
Indeed. But reworking this means modifying how we install users on all platforms, having to wrestle with edge cases that the current installer's process is too ~dumb to have to worry about (like what to do if the UIDs we want are taken by old nixbld users that may not match the nixbuildN user we were planning to put at that UID, what to do if we run out of valid role UIDs on macOS before placing the requested number of users, etc.), and having to go test it on a broad spectrum of systems to confirm the change.
While this refactor might save us from having to relocate to a new range a few years in the future, it would not fix the underlying problem here--this macOS update's installer currently clobbers our existing users to take the UIDs for its own daemons. This could of course happen to any UID we use on any update to any existing install on macOS--either they'll have to stop doing this without relocating our users, or Nix itself needs to get smart enough to detect the situation and recover from it or suggest remediation steps to the user.
As far as upgrading goes, perhaps the script should be more nixos/puppet-like (declarative/idempotent) and create the users only if they don't exist.
Not sure if you mean this in the context of the migration script, or the installer. If the latter, I broadly agree--but full idempotence is tricky to reach and maintain, and a lot of idempotence-focused work can lead to minimal benefit if/when one or two things block full idempotence (i.e., we do the work and testing, but the installer will still break or bail somewhere and we still have to tell frustrated users to go manually uninstall and reinstall).
I had thoughts and laid down some patterns for getting us here a few years back, but at this point I imagine this is more likely to come from working on the NixOS org's fork of the detsys installer (directly or by contributing to the upstream detsys installer itself). That said, I'll note that macOS eminent-domaining our UIDs and clobbering the users in the process is also causing trouble for their installer. (IIRC is breaks their ability to do an uninstall, for example.)
It would also be super convenient if the code that got the error in #10912 could call the upgrade script and fix it right there, but I don't know if it has the correct permissions at that point in time. It could at least mention your upgrade script or prompt the user to re-install.
I agree, but I think that's a nix-nix question outside of the scope of this PR (and I imagine it would be better if that took the form of a more general user fixup routine instead of having to figure out how to suggest macos version-specific cleanup to only the right users).
I'll also note that--unless Apple changes the updater to be a bit more polite--the cake is mostly-baked here. Even if someone opened a PR to support this in Nix today and there was a release cut by the end of the week, some fraction of Nix's macOS users will not be using that Nix release when they take the Sequoia update (whether that's a beta this summer or the official release this fall).
I’m sorry that I didn’t yet get around to testing the migration; I will try to do so soon.
@abathur How do you feel about trying to land the UID (and preferably GID) changes for new installs only – which has to be done regardless – and we can worry about migration when it becomes clearer if Sequoia is going to implement any kind of migration itself?
No worries :)
- I suppose I can separate the migration script out (hopefully tonight).
- I can't say my expectations are high, but I have been hoping that this report getting escalated within Apple will shake loose some guidance on good UID ranges. (I'll try to remember to follow up on that tonight--it's been 2 weeks since they reported escalating it.)
I'm not opposed to changing the GID (whether here or in a separate PR to ensure both are easy to revert from GH without unrelated regressions), but I'm conservative about fiddling with these and do need some convincing:
- I don't have a spare eligible macbook sitting around this cycle that I can use to readily go try different macOS versions/updates out on to see for myself and confirm the fix.
- I can't recall seeing (nor finding via search) a clear report of the problem the current GID is causing (and demonstration that moving the GID fixes it) here, against the detsys installer, or against the Lix fork of it.
- I'm not aware of any prior art here where someone's demonstrated that moving the GID down won't cause its own trouble. (The detsys and lix installers are both still using 30000 AFAIK.)
If there is an issue pointing me that way is fine, but if not could you open one and document what you're seeing there (ideally w/ screenshots)?
emily@yuyuko ~> sudo dseditgroup -o create -r 'test group for emily 1' -i 3000 emilytest1
emily@yuyuko ~> sudo dseditgroup -o create -r 'test group for emily 2' -i 360 emilytest2
(Sonoma 14.5)
I fiddled with a bunch of combinations along these lines and a bunch of different IDs but the results were pretty clear; as I mentioned in https://github.com/NixOS/nix/issues/10892#issuecomment-2169253635 the threshold for groups seems to be 500 for whatever reason, but of course picking one that matches the UID we’ll always take up seems the most conservative choice and Apple do use GID 395–400 and 441 as of Sonoma.
I don’t know if I can prove the negative re: the group ID potentially causing problems, but I can’t personally foresee any problems that we wouldn’t already get with UIDs. I’m open to trying to find the time to test stuff in a VM if you have some proposals for ways to test things, though. From reading the old discussion from when we moved the UIDs, I get the impression we were just too preoccupied with those more pressing issues to think about whether there might be any side‐effects of having a GID outside the system range too.
I don’t know if I can prove the negative re: the group ID potentially causing problems, but I can’t personally foresee any problems that we wouldn’t already get with UIDs. I’m open to trying to find the time to test stuff in a VM if you have some proposals for ways to test things, though. From reading the old discussion from when we moved the UIDs, I get the impression we were just too preoccupied with those more pressing issues to think about whether there might be any side‐effects of having a GID outside the system range too.
To clarify, I don't mean that I won't PR the change without meeting that standard of proof--I think your comment here reasonably demonstrates the issue and shows that a lower UID addresses it. I just meant that one reason I'm treading cautiously is that I can't just ~transfer confidence from looking at the other installers frontrunning us on this for days/weeks/months with issues/PRs to document the problem+fix and a lack of subsequent reports I could take as supporting evidence.
Oh yeah I totally understand the conservatism here, don’t worry. I just figure if we’re on Apple’s wild ride for the time being anyway we might as well improve the UX, especially if we do end up having to make everyone do a manual migration. I guess if I find the time to test the Sequoia migration I can make a group in the system range before the upgrade and see what happens to it?
I think the DetSys installer has some kind of A/B testing roll‐out stuff. I don’t know how quickly we could get data on potential GID problems with that though.
Reposting this inline since it’s hidden in the commit comments currently:
Maybe we should make the group name _nixbld too for consistency with the users and the other system groups in the range while we’re at it? Probably doesn’t matter that much, but I seem to recall we renamed the users to get them hidden in some way so it might not hurt to follow suit with the group.
Latest force-push is just a rebase to see if the installer jobs will run.
Installer jobs did run. (For context, they were broken because GH has switched over to arm macs. The upshot is that, while the test-generated installers covered x86_64-darwin, they now cover aarch64-darwin instead. That's pretty great for our case here :)
Installer tests in my fork succeeded; individuals can test with (note that test installers aren't generated for all platforms):
sh <(curl -L https://abathur-nix-install-tests.cachix.org/serve/5ri842sqcg061q1lk2p9zki0s0q1li1r/install) --tarball-url-prefix https://abathur-nix-install-tests.cachix.org/serve
@tomberek This is a pr that should fix that issue.
(FWIW: we learned more about Apple’s recommended UID range for role users, and Determinate Systems have adopted an approach based on that temporarily, so this probably needs a bit more research before we can commit fully to the approach. I’ve been somewhat negligent at getting around to doing VM testing but hope to get around to it soon.)
I just set mine at 2000, does it have to be < 500? I know it'll pollute the UI.
I just set mine at 2000, does it have to be < 500? I know it'll pollute the UI.
@randallb We don't know. If the underlying problem documented in the issue below still exists, you may get booted into recovery mode on macOS updates.
- https://github.com/NixOS/nix/issues/4531
We know moving to the role user range (200-400) fixed that problem at that time, and I am not personally aware of any user reports that explicitly attest to the absence of that problem on update with uids outside of this range.
What if you do the same as linux is already doing is detect uuid/guids and find free uuid/guids that work?
@tomberek if you are tempted to backport and cut any releases once this is merged, @emilazy noted we should make a quick pass to change the first UID from this change and the migration script PR both up to 351 to keep the same alignment between _nixbldN and UID as before and as on linux installs.
(Emily will probably open that PR. I may also do it for expedience if I am ~around when this merges--but some family stuff will make my schedule erratic this week.)
@abathur I’ve opened https://github.com/NixOS/nix/pull/11372 for the migration script, and https://github.com/abathur/nix/pull/15 targeted at this PR branch. You should be able to rebase/fast‐forward merge it even before this PR is merged, if that’s convenient for you; otherwise I can retarget it at NixOS/nix after this is merged, but it’d be more convenient to have to backport fewer PRs.
This should definitely be backported all the way back to 2.18; 24.05 is going to have 2.18 as the default for the rest of its lifespan so there will still be users installing it for a while.
(Do we need to backport the migration script? I guess few people will be running it out of the tarball.)
Also can't you do a check for guild and uuid like Linux has but for Darwin and automatically assign free ones.
(Do we need to backport the migration script? I guess few people will be running it out of the tarball.)
I don't think so (at the moment, at least. We have discussed a little whether to run it from nix-nix or the installer, but nothing is implemented there atm).
I like the idea of making the installer be able to do it because we already have an established curl | sh pipeline for that. (But not relevant for this PR, which we should land and backport.)
Just checking if you plan to hit the merge button on https://github.com/abathur/nix/pull/15 into your branch or if I should prepare to split it off into its own PR after?
I also opened https://github.com/DeterminateSystems/nix-installer/issues/1115 for the DetSys installer and will try to get this fixed in nix-darwin before the Sequoia release (ideally automatically doing a migration if possible).
I've been using some logic like this to help in scenarios where someone is trying install on top of an existing install in an attempt to fix the problem and migrate them before our UID-checker fails.
if is_os_darwin && poly_group_exists "$NIX_BUILD_GROUP_NAME" ; then
export NIX_FIRST_BUILD_UID=351
@usrShareDir@/scripts/migrate_uids_to_sequoia.sh
fi
I like the idea of making the installer be able to do it because we already have an established
curl | shpipeline for that. (But not relevant for this PR, which we should land and backport.)
I did look at this, but since a reinstall will fail fairly early before build users are otherwise touched, my qualm is about the ~experience of mutating the users on a run we know will fail. (But this isn't a hill I'm eager to die on if anyone's confident that this is incrementally better.)
Just checking if you plan to hit the merge button on abathur#15 into your branch or if I should prepare to split it off into its own PR after?
I think I can squeeze this in today, though I'm not pulling the trigger right this moment since I'm pondering the GID part of this. Since most macOS service users we're aware of are using the same UID+GID, I wonder if using 350 there instead of 351 leaves us a little more exposed than necessary to clashes (probably with some other non-Apple service, at least for the next few years)?
I think if the Apple devrel said 350+ is fine then 350 is probably as fine as anything for GID. If Apple (or someone else) takes up 350 then we’re in danger anyway :)
Right now Linux and Darwin have the property that the GID and UIDs don’t overlap, which I feel like we shouldn’t mess with at the same time as this, just in case. I don’t know if anything is going to explicitly hardcode an assumption about matching UIDs and GIDs, but if it did then we probably wouldn’t want the Nix builder group to be associated with _nixbld1 specifically in case that causes cursed builder‐user‐specific issues.
(We could also do 349, if 350 looks worryingly round for us to take, but I think it should be fine. There’s no real solution to this coordination problem anyway, sadly…)
@abathur @emilazy @cole-h @mkenigs
I'm intending to make a change to turn this throw into a printed error and a continue. This should make it a noisy warning, (with a URL to the migration script or relevant issue/doc page?) but still allow usage. It would also be easy to backport it.
https://github.com/NixOS/nix/blob/master/src/libstore/unix/user-lock.cc#L77-L78
This pull request has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/2024-08-28-nix-team-meeting-minutes-173/51302/1
@abathur @emilazy @mkenigs @cole-h I'm intending to merge this as-is. Any blockers?
I'd still like to have this be automatically applied, but I an leave that for a future PR.
I believe this is ready, but the backport labels down to 2.18 should be applied first, as otherwise users tracking the Nix version used by 24.05 or other intermediate versions will get broken Sequoia upgrades. Automatic migration would be nice but should be handled another time.