sdk icon indicating copy to clipboard operation
sdk copied to clipboard

Inconsistent NuGet push order in .NET release publishing

Open filipnavara opened this issue 3 years ago • 39 comments
trafficstars

Describe the bug

When new .NET service release is pushed a lot of packages are pushed to the NuGet feed. Among those packages are the workload manifests and the workload packages referenced from those manifests. There is currently no enforced order in which these assets are uploaded which can cause the manifest feed to refer to non-existent packages. This will cause temporary failures on dotnet workload install command.

Example:

Run dotnet workload install macos ios
  dotnet workload install macos ios
  shell: /bin/bash -e {0}
  env:
    DOTNET_ROOT: /Users/teamcity/.dotnet

Skip NuGet package signing validation. NuGet signing validation is not available on Linux or macOS https://aka.ms/workloadskippackagevalidation .
Updated advertising manifest microsoft.net.sdk.tvos.
Updated advertising manifest microsoft.net.sdk.maui.
Updated advertising manifest microsoft.net.sdk.maccatalyst.
Updated advertising manifest microsoft.net.sdk.ios.
Updated advertising manifest microsoft.net.workload.emscripten.
Updated advertising manifest microsoft.net.sdk.android.
Updated advertising manifest microsoft.net.sdk.macos.
Updated advertising manifest microsoft.net.workload.mono.toolchain.
Installing pack Microsoft.NET.Runtime.MonoAOTCompiler.Task version 6.0.2...
Writing workload pack installation record for Microsoft.NET.Runtime.MonoAOTCompiler.Task version 6.0.2...
Installing pack Microsoft.NET.Runtime.MonoTargets.Sdk version 6.0.2...
Writing workload pack installation record for Microsoft.NET.Runtime.MonoTargets.Sdk version 6.0.2...
Installing pack Microsoft.NETCore.App.Runtime.AOT.Cross.ios-arm version 6.0.2...
Workload installation failed. Rolling back installed packs...
Rolling back pack Microsoft.NET.Runtime.MonoAOTCompiler.Task installation...
Uninstalling workload pack Microsoft.NET.Runtime.MonoAOTCompiler.Task version 6.0.2…
Rolling back pack Microsoft.NET.Runtime.MonoTargets.Sdk installation...
Uninstalling workload pack Microsoft.NET.Runtime.MonoTargets.Sdk version 6.0.2…
Rolling back pack Microsoft.NETCore.App.Runtime.AOT.Cross.ios-arm installation...
Workload installation failed: microsoft.netcore.app.runtime.aot.osx-x64.cross.ios-arm::6.0.2 is not found in NuGet feeds https://nuget.pkg.github.com/emclient/index.json;https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-eng/nuget/v3/index.json;https://api.nuget.org/v3/index.json;https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet7/nuget/v3/index.json".
Error: Process completed with exit code 1.

filipnavara avatar Feb 08 '22 18:02 filipnavara

Hi @filipnavara, thanks for raising this issue.

We don't see this as a workloads-specific issue, instead we see this as an issue with the way dotnet as an organization publishes packages on NuGet. We push a huge number of packages with each servicing release between workloads, various .NET Runtimes, and library releases, and several of these have similar timing issues as you describe here. Even if we had some ordering where we uploaded all workload packs prior to uploading the manifests, there's nothing enforcing on NuGet's end that indexing/CDN distribution/etc. keep that same ordering. The best advice I can give you is to delay a bit of time and try again in the general case.

baronfel avatar Feb 11 '22 19:02 baronfel

The problem for individual NuGets is different since we are in control of the versions (which we update through explicit Dependabot PR, dotnet/arcade or other mechanism). However, with workloads there is no control unless we enforce a usage of a full custom manifest. So the moment the new manifest is pushed to NuGet it breaks our CI pipelines. During the publishing of .NET 6.0.2 the breakage lasted approximately 2 hours. I am rising the issue in an attempt to find a way how to minimize the disruption in future. While I can easily identify what is happening and what was the cause of the errors others may not.

filipnavara avatar Feb 11 '22 20:02 filipnavara

I agree that workloads are a special case here because of the way advertising manifests are automatically updated during restore.

Maybe we can skip a manifest during that automatic update until it is at least X hours old?

akoeplinger avatar Feb 11 '22 23:02 akoeplinger

FWIW we hit it again when 6.0.200 SDK was being published.

We install the SDK as part of the CI pipeline. So in addition to the order of NuGets being pushed the SDK acquisition process also gets the newer SDK first [before the NuGets are uploaded]. That also doesn't seem quite right.

filipnavara avatar Feb 16 '22 09:02 filipnavara

Hitting this presently with the current rollout for the ios workload.

sudo dotnet workload install ios
...
Workload(s) 'ios' are already installed.
Skipping NuGet package signature verification.
Installing workload manifest microsoft.net.sdk.ios version 16.2.1024…
Installing workload manifest microsoft.net.sdk.maccatalyst version 16.2.1024…
Installing workload manifest microsoft.net.sdk.macos version 13.1.1024…
Installing workload manifest microsoft.net.sdk.tvos version 16.1.1521…
Installing pack Microsoft.iOS.Sdk version 16.2.1024...
Writing workload pack installation record for Microsoft.iOS.Sdk.net7 version 16.2.1024...
Installing pack Microsoft.iOS.Sdk version 16.2.19...
Workload installation failed. Rolling back installed packs...
Rolling back pack Microsoft.iOS.Sdk installation...
Rolling back pack Microsoft.iOS.Sdk installation...
Uninstalling workload pack Microsoft.iOS.Sdk.net7 version 16.2.1024…
Workload installation failed: microsoft.ios.sdk::16.2.19 is not found in NuGet feeds https://api.nuget.org/v3/index.json".

On closer inspection, it appears that Microsoft.NET.Sdk.iOS.Manifest-7.0.100 version 16.2.1024 contains:

"packs": {
    "Microsoft.iOS.Sdk.net7": {
        "kind": "sdk",
        "version": "16.2.1024",
        "alias-to": {
            "any": "Microsoft.iOS.Sdk"
        }
    },
    "Microsoft.iOS.Sdk.net6": {
        "kind": "sdk",
        "version": "16.2.19",
        "alias-to": {
            "any": "Microsoft.iOS.Sdk"
        }
    },

And it appears that Microsoft.iOS.Sdk version 16.2.1024 has been published, but version 16.2.19 has not - yet.

image

I suggest that there be some validation added such that a manifest can't be published until all of its dependent packages have been published first.

mattjohnsonpint avatar Feb 07 '23 20:02 mattjohnsonpint

Also note that this breaks our CI/CD builds, since the current workloads are installed with each build. We can't build until the workload install succeeds.

mattjohnsonpint avatar Feb 07 '23 20:02 mattjohnsonpint

Note also, even after it appeared on Nuget a while later, it's still not available until fully indexed.

image

Thus, the validation would need to really make sure it was available on nuget, not just sent there.

mattjohnsonpint avatar Feb 07 '23 20:02 mattjohnsonpint

FYI, it's working again now that the indexing has completed. Hope the validation can be added so we don't encounter it again next time.

mattjohnsonpint avatar Feb 07 '23 20:02 mattjohnsonpint

@baronfel @marcpopMSFT what are our options here? Is this something that can be controlled via nuget publishing today and we need to opt in? Or is this a wider discussion we need to have?

steveisok avatar Feb 07 '23 20:02 steveisok

@rbhanda in case he has any ideas.

We could potentially control the order of publish to nuget for the runtime workloads so that the packs go first and then the manifests. We could also coordinate a delayed manifest publish. We'd still have a problem if someone caught one of the manifests but not the others since there are half a dozen of them and they got into a partial state. Does nuget.org have the ability to publish first but not make them visible so we could do that all at once?

Note that the above issue was the maui workload publishing and I don't know if that's managed by Rahul or my the maui team.

marcpopMSFT avatar Feb 08 '23 01:02 marcpopMSFT

Yeah, the underlying issue here is that workloads conceptually provide a mechanism to tie multiple packages together as a unit and our nuget publishing pipeline has no concept of that dependency at any level. To really handle this we'd need something like a staged transaction to the nuget database.

lewing avatar Feb 08 '23 01:02 lewing

NuGet can "unlist" packages which hides them from search results etc (and theoretically also from workloads assuming we correctly implement this).

As far as I know nuget.org doesn't support pushing a package in the unlisted state but there's an API to unlist that we could call as soon as we push the manifest package and then relist it later once all the other packages are pushed.

Or ask the NuGet team to add a ?publishUnlisted=true parameter to the push API.

akoeplinger avatar Feb 09 '23 15:02 akoeplinger

CC @JonDouglas @aortiz-msft for the interesting feature request (either batching or publish unlisted).

I'll ping @rbhanda offline and see if publishing, unlisting, and then later listing is a potential option here.

marcpopMSFT avatar Feb 09 '23 21:02 marcpopMSFT

We could potentially control the order of publish to nuget for the runtime workloads

This is definitely something that we can look into though, as noted, the package index status seems to be crux of the issue.

leecow avatar Feb 09 '23 23:02 leecow

This would be a NuGet.org ask so tagging @joelverhagen and @clairernovotny

aortiz-msft avatar Feb 11 '23 02:02 aortiz-msft

As @baronfel mentioned availability order of packages on NuGet.org is not guaranteed, even if packages are pushed in some clever (e.g. reverse dependency) order. This is because our asynchronous validation pipeline (malware scanning, signing, and more) can take a different amount of time per package. Imagine there is a package in the middle of the dependency graph that takes a bit longer to malware scan or is owned by another team that runs their push at a slightly different time (this happens more than you'd think). It's sort of a messy problem.

It's not currently feasible to fix the availability order unless the actor pushing the package polls for package availability before continuing with other package pushes. If done naively this would have horrible throughput (avg validation time * total number of packages to push, so many hours for big .NET releases). It would be possible to essentially do a topological sort to identify sets of packages that can be pushed together but this is all a hack/workaround for the feature gap on NuGet.org.

This general feature request is already tracked here https://github.com/NuGet/NuGetGallery/issues/3931. Feel free to add additional comments or upvote. I think the described solution ("staging") is the Proper fix for the problem, but it's a lot of work and needs some exploration with stakeholders to make sure the staging works as needed. I can say it's not on our radar for the next 6 months given our other priorities and team capacity.

We could potentially control the order of publish to nuget for the runtime workloads so that the packs go first and then the manifests. We could also coordinate a delayed manifest publish. We'd still have a problem if someone caught one of the manifests but not the others since there are half a dozen of them and they got into a partial state. Does nuget.org have the ability to publish first but not make them visible so we could do that all at once?

Responding to @marcpopMSFT's https://github.com/dotnet/sdk/issues/23820#issuecomment-1421757609, it is possible to publish packages as unlisted. This can be done by using the "unlist" gesture (nuget.exe, .NET CLI, API endpoint) immediately after pushing, while the package is still in the validating state. This will always be the case since validation takes 2+ minutes and you can unlist immediately after the push request completes. Then, after you have detected that packages should be listed (i.e. the fully dependency graph is available), you can use the "relist" gesture (no CLI support, but has API support).

As far as I know nuget.org doesn't support pushing a package in the unlisted state but there's an API to unlist that we could call as soon as we push the manifest package and then relist it later once all the other packages are pushed.

Responding to @akoeplinger's https://github.com/dotnet/sdk/issues/23820#issuecomment-1424349420, yes that's right. As mentioned to Marc above, you can unlist immediately after push so it can work today. Feel free to open an issue about enhancing the push protocol to include a "listed = false" parameter. This would remove the additional round trip, eliminate any crazy race condition caused by a slow/errored unlist or hyper fast validation, and provide parity with the UI upload flow which currently allows uploading a package as unlisted.

FYI, the best way to detect if a version is available is to use this API endpoint (a HEAD request should be fine): https://learn.microsoft.com/en-us/nuget/api/package-base-address-resource#download-package-content-nupkg. This endpoint has good performance and availability globally and properly caches 200s but not 404s.

joelverhagen avatar Feb 13 '23 17:02 joelverhagen

Thanks for the detailed response @joelverhagen. Glad to know there's a 2 minute validation window during which we can unlist.

I met with our release team and we have a low cost proposal for the next release. The plan is to publish all .NET packages excluding the manifest packages, poll for availability (which I understand they already do), and then publish the manifest packages as a separate step afterwards. There is potential delay from the polling and there is still the potential for customer impact since the dozen or so manifests would still potentially light up at different times for different customers but it would reduce that window from 30+ minutes down to a much smaller window.

If that goes well, we'll continue with that option. If there are still issues, we can explore the option of pushing unlisted and then listing all of the manifests together. I'll be meeting with some Maui folks later today and suggesting they follow the same pattern as their publish is a separate process from the core .NET package publish today.

marcpopMSFT avatar Feb 13 '23 18:02 marcpopMSFT

Hitting this again presently.

Workload update failed: microsoft.ios.sdk::16.2.29 is not found in NuGet feeds https://api.nuget.org/v3/index.json".

Related to rollout of https://www.nuget.org/packages/Microsoft.NET.Sdk.iOS.Manifest-7.0.100/16.2.1040

image

mattjohnsonpint avatar Mar 14 '23 16:03 mattjohnsonpint

image

Should be resolved when indexing is complete.

mattjohnsonpint avatar Mar 14 '23 16:03 mattjohnsonpint

Working again now.

mattjohnsonpint avatar Mar 14 '23 16:03 mattjohnsonpint

@marcpopMSFT this seems to be happening again with today's release, is the proposed mitigation you outlined supposed to happen this release already or in some future release?

akoeplinger avatar Apr 11 '23 14:04 akoeplinger

Thanks for the detailed response @joelverhagen. Glad to know there's a 2 minute validation window during which we can unlist.

I met with our release team and we have a low cost proposal for the next release. The plan is to publish all .NET packages excluding the manifest packages, poll for availability (which I understand they already do), and then publish the manifest packages as a separate step afterwards. There is potential delay from the polling and there is still the potential for customer impact since the dozen or so manifests would still potentially light up at different times for different customers but it would reduce that window from 30+ minutes down to a much smaller window.

If that goes well, we'll continue with that option. If there are still issues, we can explore the option of pushing unlisted and then listing all of the manifests together. I'll be meeting with some Maui folks later today and suggesting they follow the same pattern as their publish is a separate process from the core .NET package publish today.

Working again now. Is this method used this time when publishing 16.2.46 ?

liejuntao001 avatar Apr 11 '23 17:04 liejuntao001

@akoeplinger @liejuntao001 can you confirm which component was out of order? We're working with release teams but it's mostly a manual process to ensure that the manifests get published last and that may have been missed this release.

marcpopMSFT avatar Apr 12 '23 18:04 marcpopMSFT

Some reports I saw:

Workload installation failed: microsoft.netcore.app.runtime.mono.android-x64::6.0.16 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
Workload installation failed: Workload manifest dependency 'Microsoft.NET.Workload.Emscripten.net6' version '7.0.4' is lower than version '7.0.5' required by manifest 'microsoft.net.workload.mono.toolchain.net6' [/Users/runner/hostedtoolcache/dotnet/sdk-manifests/7.0.100/microsoft.net.workload.mono.toolchain.net6/WorkloadManifest.json]

akoeplinger avatar Apr 12 '23 18:04 akoeplinger

poll for availability (which I understand they already do), and then publish the manifest packages as a separate step afterwards.

@marcpopMSFT can you share some details on the "poll for availability" step? Is there an example of how the .NET release team did it?

This is not implemented in any of the MAUI release pipelines yet.

jonathanpeppers avatar Apr 12 '23 18:04 jonathanpeppers

I saw these from our Azure pipeline builds:

2023-04-11T14:22:03.8990170Z Workload installation failed: microsoft.netcore.app.runtime.mono.android-x64::6.0.16 is not found in NuGet feeds https://api.nuget.org/v3/index.json".

2023-04-11T15:10:42.2530970Z Workload installation failed: microsoft.ios.sdk::16.2.46 is not found in NuGet feeds https://api.nuget.org/v3/index.json".

2023-04-11T15:43:07.5734510Z Workload installation failed: microsoft.ios.sdk::16.2.46 is not found in NuGet feeds https://api.nuget.org/v3/index.json".

later worked at 2023-04-11T17:21:17.8915980Z Successfully installed workload(s) android ios maui wasm-tools.

liejuntao001 avatar Apr 12 '23 19:04 liejuntao001

The fact we have some reports of:

2023-04-11T14:22:03.8990170Z Workload installation failed: microsoft.netcore.app.runtime.mono.android-x64::6.0.16 is not found in NuGet feeds https://api.nuget.org/v3/index.json".

Seems like maybe the .NET release side wasn't working either?

jonathanpeppers avatar Apr 12 '23 19:04 jonathanpeppers

I just checked with our release team. It looks like they had not recorded the outcome of our planned release ordering in their release steps and so had missed delaying the manifest publish until after all of the other packages are published. They plan on updating that process and hopefully this will be improved in May.

For summary, the plan is to take the manifests out of the normal release process, publish all other packages, and then publish the manifests. This doesn't guarantee that everything will work as it depends on when the packages become available on nuget.org but will greatly reduce the window where things might be broken (current publish takes >30 minutes and the manifests were publishing near the beginning of that process creating a large window where all workload commands would fail). We're still asking for a feature from nuget to link packages together and make them all available at the same time.

marcpopMSFT avatar Apr 12 '23 20:04 marcpopMSFT

@marcpopMSFT this plan didn't quite work for us in practice -- we published the Android manifest last, and yet it was the first available on NuGet.org. It is just such a smaller package, I think it made it through the queue quickly.

So, I'm interested in how we "poll for availability" on NuGet. We could also put a 30 min. delay in the release pipeline, but I'm not sure that is a 100% solution. I believe NuGet was quite busy this last release day, and it took around an hour for things to resolve.

jonathanpeppers avatar Apr 12 '23 20:04 jonathanpeppers

Is there any way an internal validation check could be added to dotnet workload install/update commands such that the previous version of a workload would continue to be used until the new version is fully published? It seems like it could send HEAD requests to nuget (as mentioned before) to validate the manifest before attempting to install any packages. If any package in the latest manifest isn't available, then publishing is in-progress so use the previous version and continue without failure. This would keep end-user CI builds from breaking, regardless of what changes are made to the publishing workflow.

mattjohnsonpint avatar Apr 13 '23 15:04 mattjohnsonpint