Regression with package restore: RID Graph not respected when selecting native library from nuget package
According to RID Graph, when restoring nuget packages, it should be the case that the when a library for a specific rid is available it should prefer that before walking the fallback graph. Instead, the behavior I'm seeing is that it conflicts with a library in a fallback rid.
I've added support for gRPC with .net core on raspberry pi with a package for that specific rid: linux-arm. My nuget package uses the convention for native libraries runtimes\<rid>\native\<lib>. The grpc core package also uses this convention, but doesn't have the linux-arm rid, just linux. Both packages are required when publishing for arm, but with the native library from mine and the .net assembly from grpc core.
When publishing with -r linux-arm, I see the following:
Encountered conflict between 'CopyLocal:C:\Users\Geocoder\.nuget\packages\grpc.core\2.26.0\runtimes\linux\native\libgrpc_csharp_ext.x86.so' and 'CopyLocal:C:\Users\Geocoder\.nuget\packages\libgrpc_csharp_ext.arm7\1.0.9\runtimes\linux-arm\native\libgrpc_csharp_ext.x86.so'. Choosing 'CopyLocal:C:\Users\Geocoder\.nuget\packages\grpc.core\2.26.0\runtimes\linux\native\libgrpc_csharp_ext.x86.so' arbitrarily as both items are copy-local and have equal file and assembly versions.
Here are steps to reproduce:
create a new project and add the packages
dotnet new console -o grpcTest
dotnet add package Grpc.Core
dotnet add package Grpc.Net.Client
dotnet add package Grpc.Tools
dotnet add package libgrpc_csharp_ext.arm7 --version 1.1.0
publish to arm
dotnet publish -r linux-arm --self-contained
Is there a work-around?
@bitzeal-johan yes there is a workaround. You can compile the lib yourself as @erikest outlined in https://github.com/erikest/libgrpc_csharp_ext and then replace the binary in the output folder
dotnet publish -r linux-arm --self-contained
cp libgrpc_csharp_ext.x86.so ./bin/Debug/netcoreapp3.1/linux-arm/publish/
I used this method a while ago (on net3.1 though)
I just tried this with the .NET 6 SDK. I don't see this error.
This is what I see in terms of the .so file that was mentioned.
root@a1e6b08d8845:/grpcTest# find . | grep libgrpc_csharp_ext.x86.so
./bin/Debug/net6.0/linux-arm/publish/libgrpc_csharp_ext.x86.so
./bin/Debug/net6.0/linux-arm/libgrpc_csharp_ext.x86.so
Perhaps the packages have been updated.
@richlander also true for .NET 5
$ find . | grep libgrpc_csharp_ext.x86.so
./bin/Debug/net5.0/linux-arm/publish/libgrpc_csharp_ext.x86.so
./bin/Debug/net5.0/linux-arm/libgrpc_csharp_ext.x86.so
the package libgrpc_csharp_ext.arm7 remains unchanged however
The issue persists with .NET 8.0 and .NET 9.0 preview, see:
$ dotnet new console -o netvips
$ cd netvips/
$ dotnet add package NetVips.Native.linux-arm64 --version 8.15.2
$ dotnet add package NetVips.Native.linux-musl-arm64 --version 8.15.2
$ dotnet publish -r linux-musl-arm64 -o .
$ echo "3aa9e0d2865e8a65a0d4d7fa4011e1ab2b6e5c313b53d7584363da817db2b45f libvips.so.42" | sha256sum --check --quiet && echo "arm64-musl libvips.so.42 detected"
libvips.so.42: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
$ echo "d9a06f7adce8a07213c68ba0be7046a255d2952893174fed5b1187589e770aa2 libvips.so.42" | sha256sum --check --quiet && echo "arm64-glibc libvips.so.42 detected"
arm64-glibc libvips.so.42 detected
$ readelf -d libvips.so.42 | grep NEEDED
0x0000000000000001 (NEEDED) Shared library: [libresolv.so.2]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
Introducing the linux-glibc-arm64 RID could possibly fix this, as it ensures that this fallback never happens, see: https://github.com/dotnet/runtime/issues/7318
This was originally reported at https://github.com/kleisauke/net-vips/issues/233.
I further debugged this issue using the following command:
$ dotnet publish -r linux-musl-arm64 -o . -v diag | grep "arbitrarily"
Which outputs:
Encountered conflict between 'CopyLocal:/home/kleisauke/.nuget/packages/netvips.native.linux-arm64/8.15.2/runtimes/linux-arm64/native/libvips.so.42' and 'CopyLocal:/home/kleisauke/.nuget/packages/netvips.native.linux-musl-arm64/8.15.2/runtimes/linux-musl-arm64/native/libvips.so.42'. Choosing 'CopyLocal:/home/kleisauke/.nuget/packages/netvips.native.linux-arm64/8.15.2/runtimes/linux-arm64/native/libvips.so.42' arbitrarily as both items are copy-local and have equal file and assembly versions.
This message is printed as part of the conflict resolution logic in the .NET SDK, specifically: https://github.com/dotnet/sdk/blob/941ec62ab28b01b4efcab6bcdaec37b2d319b965/src/Tasks/Common/ConflictResolution/ConflictResolver.cs#L329-L348
The resolver always prefers the NetVips.Native.linux-arm64 package over NetVips.Native.linux-musl-arm64 because, in an ordinal string comparison, linux-arm64 precedes linux-musl-arm64.
The assumption that the assemblies are fully semantically equivalent is questionable, as it does not hold true in this case.
As it stands, we can condition the glibc package like this: Condition="$(RuntimeIdentifier.StartsWith('linux-')) and !$(RuntimeIdentifier.StartsWith('linux-musl-'))" and musl one: Condition="$(RuntimeIdentifier.StartsWith('linux-musl-'))"which gives toolchain proper gestures to pick the right dependency. We have similar conditions in NativeAOT integration.
@baronfel -- is this a change we can make?
Looks like this one wasn't tagged so it was never triaged by the team. We'll discuss this week.
For triage purposes: the original issue was a conflict with the fallback linux RID, which is inherited by the linux-arm RID. Both provided a shared library named libgrpc_csharp_ext.x86.so, but packaged in different directories (i.e. runtimes/linux/native/ and runtimes/linux-arm/native/) across separate NuGet packages.
That issue was only reproducible with the Grpc.Core package versions prior version 2.34.0 and the libgrpc_csharp_ext.arm7 package. These 32-bit libraries were removed in commit https://github.com/grpc/grpc/commit/335a9eac23b63330897fd22f81b0880c9f5c4b33, and the RIDs were revised to arch-specific ones in commit https://github.com/grpc/grpc/commit/6dbdedd9a0257dfbcdf72891d74e62ffdeb36118.
Here's a binlog of @kleisauke's comment above. It seems pretty clear to me that we need a proper disjoint RID graph to disambiguate the two assets - they are being written to the project.assets.json in a correct way:
"NetVips.Native.linux-arm64/8.15.2": {
"type": "package",
"compile": {
"lib/netstandard1.0/_._": {}
},
"runtime": {
"lib/netstandard1.0/_._": {}
},
"runtimeTargets": {
"runtimes/linux-arm64/native/libvips.so.42": {
"assetType": "native",
"rid": "linux-arm64"
}
}
},
"NetVips.Native.linux-musl-arm64/8.15.2": {
"type": "package",
"compile": {
"lib/netstandard2.1/_._": {}
},
"runtime": {
"lib/netstandard2.1/_._": {}
},
"runtimeTargets": {
"runtimes/linux-musl-arm64/native/libvips.so.42": {
"assetType": "native",
"rid": "linux-musl-arm64"
}
}
but because our entire toolchain doesn't recognize that linux-musl isn't necessarily compatible with linux that original sin taints our entire resolution/asset copying process.
It goes back to a very old discussion https://github.com/dotnet/runtime/issues/3075#issuecomment-377328964. Basically RID is not guaranteeing "binary compatibility". I think in this case, when both linux-musl-{arch} and linux-{arch} assets are present, all we need is to give precedence to linux-musl one. Same goes for linux-bionic (aka Android) etc.
Triage: @elinor-fung this looks like a potential ordering issue in the RID graph. Would it make sense to fix it there rather than us trying to special casing this in conflict resolution?
I think I'm missing what the potential ordering issue is in the RID graph? In the graph itself, linux-musl-<arch> does get priority over linux-<arch>.
My understanding of this issue is that the appropriate assets for each package are selected as expected (per the RID graph, with linux-musl taking priority of linux). But in this repro with multiple packages provide the same asset, the SDK chooses one alphabetically. Could it use the RID graph as part of the conflict resolution?
According to https://github.com/kleisauke/net-vips/issues/233#issuecomment-2180473285, merging NetVips.Native.linux-arm64 and NetVips.Native.linux-musl-arm64 into a single NuGet package may resolve the issue (unverified).
We've been using this method for a while and have had no more issues with the wrong native dependency being loaded. However, it's unfortunate given that repackaging anything other people choose to have split (which is a perfectly valid strategy) is extra work for us, so I'd love to see this fixed in the sdk.
My understanding of this issue is that the appropriate assets for each package are selected as expected (per the RID graph, with linux-musl taking priority of linux). But in this repro with multiple packages provide the same asset, the SDK chooses one alphabetically. Could it use the RID graph as part of the conflict resolution?
Thanks for clarifying @elinor-fung - this does make sense to me when it's put like that.
I've attached a binlog for future analysis - the main blocker that I see to easily doing this work is that the ReferenceCopyLocalPaths Item[] parameter (which comes from the ReferenceCopyLocalPaths MSBuild Item type) has the following metadata:
Ideally we would annotate files from the packages that are RID-specific with that data, and then we could potentially use that data to de-conflict. So I see two units of work here:
- flow RID/TFM data for assets of various types into the relevant MSBuild Items. This data does exist in the project.assets.json already, so we hopefully just need to find the correct location in the build to attach this data.
- use the new RID data to de-conflict, applying a 'best match' algorithm based on the RID being targeted (which would be a new input to conflict resolution) and the RIDs of the conflicting assets.