sdk icon indicating copy to clipboard operation
sdk copied to clipboard

Regression with package restore: RID Graph not respected when selecting native library from nuget package

Open erikest opened this issue 5 years ago • 6 comments

According to RID Graph, when restoring nuget packages, it should be the case that the when a library for a specific rid is available it should prefer that before walking the fallback graph. Instead, the behavior I'm seeing is that it conflicts with a library in a fallback rid.

I've added support for gRPC with .net core on raspberry pi with a package for that specific rid: linux-arm. My nuget package uses the convention for native libraries runtimes\<rid>\native\<lib>. The grpc core package also uses this convention, but doesn't have the linux-arm rid, just linux. Both packages are required when publishing for arm, but with the native library from mine and the .net assembly from grpc core.

When publishing with -r linux-arm, I see the following:

Encountered conflict between 'CopyLocal:C:\Users\Geocoder\.nuget\packages\grpc.core\2.26.0\runtimes\linux\native\libgrpc_csharp_ext.x86.so' and 'CopyLocal:C:\Users\Geocoder\.nuget\packages\libgrpc_csharp_ext.arm7\1.0.9\runtimes\linux-arm\native\libgrpc_csharp_ext.x86.so'. Choosing 'CopyLocal:C:\Users\Geocoder\.nuget\packages\grpc.core\2.26.0\runtimes\linux\native\libgrpc_csharp_ext.x86.so' arbitrarily as both items are copy-local and have equal file and assembly versions.

Here are steps to reproduce:

create a new project and add the packages

dotnet new console -o grpcTest
dotnet add package Grpc.Core
dotnet add package Grpc.Net.Client
dotnet add package Grpc.Tools
dotnet add package libgrpc_csharp_ext.arm7 --version 1.1.0

publish to arm

dotnet publish -r linux-arm --self-contained

erikest avatar Jan 20 '20 04:01 erikest

Is there a work-around?

bitzeal-johan avatar Nov 05 '20 14:11 bitzeal-johan

@bitzeal-johan yes there is a workaround. You can compile the lib yourself as @erikest outlined in https://github.com/erikest/libgrpc_csharp_ext and then replace the binary in the output folder

dotnet publish -r linux-arm --self-contained
cp libgrpc_csharp_ext.x86.so ./bin/Debug/netcoreapp3.1/linux-arm/publish/

I used this method a while ago (on net3.1 though)

Liriel avatar Dec 03 '20 07:12 Liriel

I just tried this with the .NET 6 SDK. I don't see this error.

This is what I see in terms of the .so file that was mentioned.

root@a1e6b08d8845:/grpcTest# find . | grep libgrpc_csharp_ext.x86.so
./bin/Debug/net6.0/linux-arm/publish/libgrpc_csharp_ext.x86.so
./bin/Debug/net6.0/linux-arm/libgrpc_csharp_ext.x86.so

Perhaps the packages have been updated.

richlander avatar Jul 09 '21 04:07 richlander

@richlander also true for .NET 5

$ find . | grep libgrpc_csharp_ext.x86.so
./bin/Debug/net5.0/linux-arm/publish/libgrpc_csharp_ext.x86.so
./bin/Debug/net5.0/linux-arm/libgrpc_csharp_ext.x86.so

the package libgrpc_csharp_ext.arm7 remains unchanged however

Liriel avatar Jul 12 '21 14:07 Liriel

The issue persists with .NET 8.0 and .NET 9.0 preview, see:

$ dotnet new console -o netvips
$ cd netvips/
$ dotnet add package NetVips.Native.linux-arm64 --version 8.15.2
$ dotnet add package NetVips.Native.linux-musl-arm64 --version 8.15.2
$ dotnet publish -r linux-musl-arm64 -o .
$ echo "3aa9e0d2865e8a65a0d4d7fa4011e1ab2b6e5c313b53d7584363da817db2b45f libvips.so.42" | sha256sum --check --quiet && echo "arm64-musl libvips.so.42 detected"
libvips.so.42: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
$ echo "d9a06f7adce8a07213c68ba0be7046a255d2952893174fed5b1187589e770aa2 libvips.so.42" | sha256sum --check --quiet && echo "arm64-glibc libvips.so.42 detected"
arm64-glibc libvips.so.42 detected
$ readelf -d libvips.so.42 | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libresolv.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

Introducing the linux-glibc-arm64 RID could possibly fix this, as it ensures that this fallback never happens, see: https://github.com/dotnet/runtime/issues/7318

This was originally reported at https://github.com/kleisauke/net-vips/issues/233.

kleisauke avatar Jun 20 '24 12:06 kleisauke

I further debugged this issue using the following command:

$ dotnet publish -r linux-musl-arm64 -o . -v diag | grep "arbitrarily"

Which outputs:

Encountered conflict between 'CopyLocal:/home/kleisauke/.nuget/packages/netvips.native.linux-arm64/8.15.2/runtimes/linux-arm64/native/libvips.so.42' and 'CopyLocal:/home/kleisauke/.nuget/packages/netvips.native.linux-musl-arm64/8.15.2/runtimes/linux-musl-arm64/native/libvips.so.42'. Choosing 'CopyLocal:/home/kleisauke/.nuget/packages/netvips.native.linux-arm64/8.15.2/runtimes/linux-arm64/native/libvips.so.42' arbitrarily as both items are copy-local and have equal file and assembly versions.

This message is printed as part of the conflict resolution logic in the .NET SDK, specifically: https://github.com/dotnet/sdk/blob/941ec62ab28b01b4efcab6bcdaec37b2d319b965/src/Tasks/Common/ConflictResolution/ConflictResolver.cs#L329-L348

The resolver always prefers the NetVips.Native.linux-arm64 package over NetVips.Native.linux-musl-arm64 because, in an ordinal string comparison, linux-arm64 precedes linux-musl-arm64.

The assumption that the assemblies are fully semantically equivalent is questionable, as it does not hold true in this case.

kleisauke avatar Jun 23 '24 12:06 kleisauke

As it stands, we can condition the glibc package like this: Condition="$(RuntimeIdentifier.StartsWith('linux-')) and !$(RuntimeIdentifier.StartsWith('linux-musl-'))" and musl one: Condition="$(RuntimeIdentifier.StartsWith('linux-musl-'))"which gives toolchain proper gestures to pick the right dependency. We have similar conditions in NativeAOT integration.

am11 avatar Oct 27 '24 12:10 am11

@baronfel -- is this a change we can make?

richlander avatar Oct 27 '24 14:10 richlander

Looks like this one wasn't tagged so it was never triaged by the team. We'll discuss this week.

baronfel avatar Oct 27 '24 14:10 baronfel

For triage purposes: the original issue was a conflict with the fallback linux RID, which is inherited by the linux-arm RID. Both provided a shared library named libgrpc_csharp_ext.x86.so, but packaged in different directories (i.e. runtimes/linux/native/ and runtimes/linux-arm/native/) across separate NuGet packages.

That issue was only reproducible with the Grpc.Core package versions prior version 2.34.0 and the libgrpc_csharp_ext.arm7 package. These 32-bit libraries were removed in commit https://github.com/grpc/grpc/commit/335a9eac23b63330897fd22f81b0880c9f5c4b33, and the RIDs were revised to arch-specific ones in commit https://github.com/grpc/grpc/commit/6dbdedd9a0257dfbcdf72891d74e62ffdeb36118.

kleisauke avatar Oct 28 '24 11:10 kleisauke

sdk-issue-4195.zip

Here's a binlog of @kleisauke's comment above. It seems pretty clear to me that we need a proper disjoint RID graph to disambiguate the two assets - they are being written to the project.assets.json in a correct way:

"NetVips.Native.linux-arm64/8.15.2": {
        "type": "package",
        "compile": {
          "lib/netstandard1.0/_._": {}
        },
        "runtime": {
          "lib/netstandard1.0/_._": {}
        },
        "runtimeTargets": {
          "runtimes/linux-arm64/native/libvips.so.42": {
            "assetType": "native",
            "rid": "linux-arm64"
          }
        }
      },
      "NetVips.Native.linux-musl-arm64/8.15.2": {
        "type": "package",
        "compile": {
          "lib/netstandard2.1/_._": {}
        },
        "runtime": {
          "lib/netstandard2.1/_._": {}
        },
        "runtimeTargets": {
          "runtimes/linux-musl-arm64/native/libvips.so.42": {
            "assetType": "native",
            "rid": "linux-musl-arm64"
          }
        }

but because our entire toolchain doesn't recognize that linux-musl isn't necessarily compatible with linux that original sin taints our entire resolution/asset copying process.

baronfel avatar Oct 29 '24 14:10 baronfel

It goes back to a very old discussion https://github.com/dotnet/runtime/issues/3075#issuecomment-377328964. Basically RID is not guaranteeing "binary compatibility". I think in this case, when both linux-musl-{arch} and linux-{arch} assets are present, all we need is to give precedence to linux-musl one. Same goes for linux-bionic (aka Android) etc.

am11 avatar Oct 29 '24 15:10 am11

Triage: @elinor-fung this looks like a potential ordering issue in the RID graph. Would it make sense to fix it there rather than us trying to special casing this in conflict resolution?

marcpopMSFT avatar Oct 29 '24 20:10 marcpopMSFT

I think I'm missing what the potential ordering issue is in the RID graph? In the graph itself, linux-musl-<arch> does get priority over linux-<arch>.

My understanding of this issue is that the appropriate assets for each package are selected as expected (per the RID graph, with linux-musl taking priority of linux). But in this repro with multiple packages provide the same asset, the SDK chooses one alphabetically. Could it use the RID graph as part of the conflict resolution?

elinor-fung avatar Oct 29 '24 22:10 elinor-fung

According to https://github.com/kleisauke/net-vips/issues/233#issuecomment-2180473285, merging NetVips.Native.linux-arm64 and NetVips.Native.linux-musl-arm64 into a single NuGet package may resolve the issue (unverified).

kleisauke avatar Oct 30 '24 10:10 kleisauke

We've been using this method for a while and have had no more issues with the wrong native dependency being loaded. However, it's unfortunate given that repackaging anything other people choose to have split (which is a perfectly valid strategy) is extra work for us, so I'd love to see this fixed in the sdk.

zotanmew avatar Oct 30 '24 10:10 zotanmew

My understanding of this issue is that the appropriate assets for each package are selected as expected (per the RID graph, with linux-musl taking priority of linux). But in this repro with multiple packages provide the same asset, the SDK chooses one alphabetically. Could it use the RID graph as part of the conflict resolution?

Thanks for clarifying @elinor-fung - this does make sense to me when it's put like that.

sdk-issues-4195.zip

I've attached a binlog for future analysis - the main blocker that I see to easily doing this work is that the ReferenceCopyLocalPaths Item[] parameter (which comes from the ReferenceCopyLocalPaths MSBuild Item type) has the following metadata:

Image

Ideally we would annotate files from the packages that are RID-specific with that data, and then we could potentially use that data to de-conflict. So I see two units of work here:

  • flow RID/TFM data for assets of various types into the relevant MSBuild Items. This data does exist in the project.assets.json already, so we hopefully just need to find the correct location in the build to attach this data.
  • use the new RID data to de-conflict, applying a 'best match' algorithm based on the RID being targeted (which would be a new input to conflict resolution) and the RIDs of the conflicting assets.

baronfel avatar Oct 30 '24 15:10 baronfel