TorchSharp icon indicating copy to clipboard operation
TorchSharp copied to clipboard

Trouble loading CUDA support under dotnet-interactive (C#)

Open tombatron opened this issue 7 months ago • 10 comments

Hi there!

This may be related to #345, so please bear with me.

I'm trying to use TorchSharp with dotnet-interactive with Jupyter notebook and I'm encountering the following behavior:

image

Now, I am running my setup through Docker, so I wondered if perhaps I had an issue there, so I made a quick console application to test "connectivity" with my GPU.

image

I'm kind of struggling to get my arms around the issue, what are some next steps I could take?

Cheers!

tombatron avatar Nov 14 '23 22:11 tombatron

I've tried to reproduce this problem with WSL, but I'm running into a very different problem, which doesn't even get as far as calling is_available()

NiklasGustafsson avatar Nov 16 '23 18:11 NiklasGustafsson

It's worth trying -- and this is a total shot in the dark -- to delete everything *torch* under ~/.nuget/packages/ and then try again. I wonder if there's some sort of package confusion going on when running with .NET Interactive.

NiklasGustafsson avatar Nov 17 '23 19:11 NiklasGustafsson

Yeah that didn't seem to have any impact. :\

Here is a directory listing of my .nuget directory on the Jupyter server:

drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 google.protobuf
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 ilgpu
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part2-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part2-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment2
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment3
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part4-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part4-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part5-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part5-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part6
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part7
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 sharpziplib
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp.nativeassets.macos
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp.nativeassets.win32
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 system.memory
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 torchsharp
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 torchsharp-cuda-linux

Here is the error message:

System.TypeInitializationException: The type initializer for 'TorchSharp.torch' threw an exception.
 ---> System.NotSupportedException: The libtorch-cpu-linux-x64 package version 2.1.0.1 is not restored on this system. If using F# Interactive or .NET Interactive you may need to add a reference to this package, e.g. 
    #r "nuget: libtorch-cpu-linux-x64, 2.1.0.1". Trace from LoadNativeBackend:

TorchSharp: LoadNativeBackend: Initialising native backend, useCudaBackend = False

Step 1 - First try regular load of native libtorch binaries.

    Trying to load native component torch_cpu relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Failed to load native component torch_cpu relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Trying to load native component LibTorchSharp relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Failed to load native component LibTorchSharp relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Result from regular native load of LibTorchSharp is False

Step 3 - Alternative load from consolidated directory of native binaries from nuget packages

    torchsharpLoc = /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0
    packagesDir = /home/jovyan/.nuget/packages
    torchsharpHome = /home/jovyan/.nuget/packages/torchsharp/0.101.2
    Trying dynamic load for .NET/F# Interactive by consolidating native libtorch-cpu-linux-x64-* binaries to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/cpu...
    Consolidating native binaries, packagesDir=/home/jovyan/.nuget/packages, packagePattern=libtorch-cpu-linux-x64, packageVersion=2.1.0.1 to target=/home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/cpu...

   at TorchSharp.torch.LoadNativeBackend(Boolean useCudaBackend, StringBuilder& trace)
   at TorchSharp.torch.InitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.InitializeDevice(Device device)
   at TorchSharp.torch..cctor()
   --- End of inner exception stack trace ---
   at TorchSharp.torch.TryInitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.cuda.is_available()
   at Submission#5.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)
   at TorchSharp.torch.TryInitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.cuda.is_available()
   at Submission#5.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

tombatron avatar Nov 18 '23 16:11 tombatron

A co-worker (@wss-rbrennan) of mine my have shed some light on this issue:

"The problem has more to do with nuget itself. TorchSharp used a clever way of putting together the libtorch-cuda-12.1-linux-x64 package because nuget has a max package size of 250mb. The work around combines multiple packages at build time in a project, so your project works, but interactive doesn't build the same way, so the reference fails."

Not sure if this is a problem per se, or just something to account for when using TorchSharp from within interactive mode or whatever?

tombatron avatar Nov 29 '23 17:11 tombatron

Thank you for the follow-up, and that's sort of what I was seeing, too. But... it used to work!

The stitching together only happens the first time, i.e. when a build finds that the stitched package is not available in the NuGet cache locally.

NiklasGustafsson avatar Nov 29 '23 17:11 NiklasGustafsson

You think there is some sort of snippet that could be run to ensure proper stitching?

On November 29, 2023, Ahmed Shirin @.***> wrote:

Thank you for the follow-up, and that's sort of what I was seeing, too. But... it used to work!

The stitching together only happens the first time, i.e. when a build finds that the stitched package is not available in the NuGet cache locally.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/TorchSharp/issues/1146#issuecomment-1832416714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5ESNE2ITABRV6RJBRQVLYG5YMVAVCNFSM6AAAAAA7LRQDI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZSGQYTMNZRGQ . You are receiving this because you authored the thread.Message ID: @.***>

tombatron avatar Nov 29 '23 17:11 tombatron

And it works on Windows, which has the same package stitching problem.

NiklasGustafsson avatar Nov 29 '23 17:11 NiklasGustafsson

You think there is some sort of snippet that could be run to ensure proper stitching?

All I can think of is a dotnet build, but I think you already did that and it worked, so the stitching should already have been done.

NiklasGustafsson avatar Nov 29 '23 17:11 NiklasGustafsson

Or, maybe... clear the ~/.nuget/packages cache, as well as anything under ~/.packagemanagement/nuget. Then, build your console program again, then try the .ipynb file again. Another shot in the dark...

NiklasGustafsson avatar Nov 29 '23 17:11 NiklasGustafsson

Okay, so after a bunch of finagling, I finally get to where you are -- no blow-up when loading the backend, but is_available() returns false. It works fine when I run one of the TorchExamples on CUDA, or on Windows interactively or console app.

NiklasGustafsson avatar Nov 29 '23 20:11 NiklasGustafsson