torch icon indicating copy to clipboard operation
torch copied to clipboard

GPU pre-built install on windows crashes on training with luz

Open Bernie-K opened this issue 10 months ago • 10 comments

I installed a torch GPU pre-built via the script on Windows10. Kind was adjusted to CUDA 12.4. It seems this is the only cuda pre-built currently supported (see #1272 ) I need to install a pre-built binary, because the machine has an older CUDA / Tensorflow / Keras installation.

options(timeout = 600) # increasing timeout is recommended since we will be downloading a 2GB file.
# For Windows and Linux: "cpu", "cu117" are the only currently supported
# For MacOS the supported are: "cpu-intel" or "cpu-m1"
kind <- "cu124"
version <- available.packages()["torch","Version"]
options(repos = c(
  torch = sprintf("https://torch-cdn.mlverse.org/packages/%s/%s/", kind, version),
  CRAN = "https://cloud.r-project.org" # or any other from which you want to install the other R dependencies.
))
install.packages("torch")

This downloads a ~2.5 GB zip https://torch-cdn.mlverse.org/packages/cu124/0.14.2/bin/windows/contrib/4.4/torch_0.14.2.zip

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] torch_0.14.2

loaded via a namespace (and not attached):
 [1] processx_3.8.5    bit_4.5.0.1       compiler_4.4.2    R6_2.6.1          magrittr_2.0.3    cli_3.6.4        
 [7] tools_4.4.2       rstudioapi_0.17.1 Rcpp_1.0.14       bit64_4.6.0-1     coro_1.1.0        callr_3.7.6      
[13] ps_1.8.1          rlang_1.1.5      

cuda_is_available() returns TRUE. I can also generate torch_tensors with device = 'cuda' and perform operations e.g. matrix muliplication on them.

However, any model training with luz crashes. Seems like #1275

The example given there crashes as any other training with luz (e.g. conv net on MNIST dataset)

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
  initialize = function(num_lags = 118){
    self$preprocess <- function(x){
      device <- x$device
      processed_vector <- torch_zeros(c(dim(x)[1],18,8), device = device)
      processed_vector[,1:8,] <- x[,1:8,]
      start_indices <- seq(9, 108, 11)
      
      for (i in 1:10) {
        start_idx <- start_indices[i]
        window <- x[, start_idx:(start_idx + 10),]
        processed_vector[, i + 8, ] <- torch_mean(window, dim = 2)
      }
      
      return(processed_vector)
    }
    
    self$num_lags <- num_lags
    
    self$res <- nn_sequential(
      nn_flatten(),
      nn_dropout(0.2),
      nn_linear(144,184),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(184,46),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1)
    )
    
    self$lstm <- nn_lstm(8,46,batch_first = TRUE)
    self$lstm_connection <- nn_sequential(
      nn_sigmoid(),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1))
  },
  
  forward = function(x){
    res <- self$res(self$preprocess(x))
    lstm <- self$lstm_connection(
      self$lstm(torch_flip(x,2))[[1]][,self$num_lags,]
    )
    torch_squeeze(nn_sigmoid()(res + lstm))
  }
)

fitted <- res_lstm %>% 
  setup(
    loss = nn_mse_loss(), 
    optimizer = optim_adam
  ) %>% 
  fit(ds, epochs = 10, dataloader_options =list(batch_size = 5))

The MWE without luz @jarroyoe came up with runs.

library(torch)

x <- torch_rand(10,118,8)
y <- torch_rand(10)

res_lstm <- nn_module(
    initialize = function(){
        self$lstm <- nn_lstm(8,46,batch_first = TRUE)
        self$lstm_connection <- nn_sequential(
            nn_sigmoid(),
            nn_linear(46,23),
            nn_sigmoid(),
            nn_linear(23,1))
    },
    
    forward = function(x){
        lstm <- self$lstm_connection(
            self$lstm(torch_flip(x,2))[[1]][,118,]
        )
        torch_squeeze(nn_sigmoid()(lstm))
    }
)

model <- res_lstm()
optimizer <- optim_adam(params = model$parameters)

for(epoch in 1:100){
	optimizer$zero_grad()
	y_pred <- model(x)
	loss <- torch_mean((y_pred - y)^2)
	cat("Epoch: ", epoch, "   Loss: ", loss$item(), "\n")
	loss$backward()
	optimizer$step()
}

I noticed that this runs on CPU only. Any attempts to move models and data to GPU end up with crashes. With a pre-built torch CPU binary install on the same machine the examples run as expected.

Bernie-K avatar Mar 01 '25 09:03 Bernie-K

I'm currently experiencing the same issue as well; Torch crashes while trying to train a model with CUDA.

sessionInfo() R version 4.5.1 (2025-06-13 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 11 x64 (build 26100)

Matrix products: default LAPACK version 3.12.1

locale: [1] LC_COLLATE=English_United Arab Emirates.utf8 LC_CTYPE=English_United Arab Emirates.utf8
[3] LC_MONETARY=English_United Arab Emirates.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United Arab Emirates.utf8

time zone: Europe/London tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] torch_0.14.2

loaded via a namespace (and not attached): [1] coro_1.1.0 R6_2.6.1 lubridate_1.9.4 bit_4.6.0 magrittr_2.0.3 glue_1.8.0
[7] timechange_0.3.0 bit64_4.6.0-1 generics_0.1.4 ps_1.9.1 cli_3.6.5 processx_3.8.6
[13] callr_3.7.6 withr_3.0.2 compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1 Rcpp_1.0.14
[19] rlang_1.1.6

torch::cuda_is_available() [1] TRUE torch::cuda_runtime_version() [1] ‘12.4.0’

library(torch) ℹ Additional software needs to be downloaded and installed for torch to work correctly. trying URL 'https://download.pytorch.org/libtorch/cu124/libtorch-win-shared-with-deps-2.5.1%2Bcu124.zip' Content type 'application/zip' length 2460542602 bytes (2346.6 MB) downloaded 2346.6 MB

trying URL 'https://torch-cdn.mlverse.org/binaries/refs/heads/cran/v0.14.2/latest/lantern-0.14.2+cu124-win64.zip' Content type 'application/x-zip-compressed' length 2575123 bytes (2.5 MB) downloaded 2.5 MB

horlar1 avatar Jun 17 '25 22:06 horlar1

This https://github.com/mlverse/torch/issues/1275#issuecomment-2657413489 solves the crash issue for me.

horlar1 avatar Jun 17 '25 23:06 horlar1

I get a crash using the pre-build when using nn_conv2d, but not other layers

bhvieira avatar Oct 30 '25 12:10 bhvieira

It's probably caused by some incompatible cudnn version. Can you share more about the environment you are running? Also, would be great to get a stack trace, maybe running outside the IDE so we get the process stacktrace.

dfalbel avatar Oct 30 '25 13:10 dfalbel

I got this

library(torch)
library(luz)

cat("=== Environment Diagnostics ===\n")
cat("R version:", R.version.string, "\n")
cat("Platform:", R.version$platform, "\n")
cat("Torch version:", as.character(packageVersion("torch")), "\n")
cat("Luz version:", as.character(packageVersion("luz")), "\n")

# Detailed CUDA diagnostics
cat("\n=== CUDA Environment ===\n")
cat("CUDA available:", cuda_is_available(), "\n")
cat("CUDA device count:", cuda_device_count(), "\n")

# Test CUDA tensor operations
cat("\n=== CUDA Tensor Test ===\n")
cuda_tensor <- torch_randn(10, 10, device = "cuda")
cat("✓ CUDA tensor creation successful\n")

# Test CUDA computation
result <- torch_mm(cuda_tensor, cuda_tensor)
cat("✓ CUDA matrix multiplication successful\n")

# Test cuDNN operations (convolution)
conv_layer <- nn_conv2d(3, 16, kernel_size = 3, padding = 1)$cuda()
test_input <- torch_randn(1, 3, 32, 32, device = "cuda")
conv_output <- conv_layer(test_input)
cat("✓ CUDA convolution (cuDNN) successful\n")

Returns:

> library(torch)
> library(luz)
> 
> cat("=== Environment Diagnostics ===\n")
=== Environment Diagnostics ===
> cat("R version:", R.version.string, "\n")
R version: R version 4.5.1 (2025-06-13 ucrt)
> cat("Platform:", R.version$platform, "\n")
Platform: x86_64-w64-mingw32
> cat("Torch version:", as.character(packageVersion("torch")), "\n")
Torch version: 0.16.1
> cat("Luz version:", as.character(packageVersion("luz")), "\n")
Luz version: 0.5.0
>
> # Detailed CUDA diagnostics
> cat("\n=== CUDA Environment ===\n")

=== CUDA Environment ===
> cat("CUDA available:", cuda_is_available(), "\n")
CUDA available: TRUE 
> cat("CUDA device count:", cuda_device_count(), "\n")
CUDA device count: 1
>
> # Test CUDA tensor operations
> cat("\n=== CUDA Tensor Test ===\n")

=== CUDA Tensor Test ===
> cuda_tensor <- torch_randn(10, 10, device = "cuda")
> cat("✓ CUDA tensor creation successful\n")
✓ CUDA tensor creation successful
>
> # Test CUDA computation
> result <- torch_mm(cuda_tensor, cuda_tensor)
> cat("✓ CUDA matrix multiplication successful\n")
✓ CUDA matrix multiplication successful
>
> # Test cuDNN operations (convolution)
> conv_layer <- nn_conv2d(3, 16, kernel_size = 3, padding = 1)$cuda()
> test_input <- torch_randn(1, 3, 32, 32, device = "cuda")
> conv_output <- conv_layer(test_input)
Could not locate cudnn_graph64_9.dll. Please make sure it is in your library path!
Invalid handle. Cannot load symbol cudnnCreate

bhvieira avatar Oct 30 '25 13:10 bhvieira

Thanks @bhvieira , that's very helpful!

Would you be able to look inside the directory returned by system.file(package="torch") and look for cuda_ related files? Perhaps something like:

fs::dir_ls(system.file(package="torch"), recurse=TRUE, glob="cuda*")

dfalbel avatar Oct 30 '25 13:10 dfalbel

Long list of files

torch/include/ATen/cuda torch/include/ATen/cuda/ApplyGridUtils.cuh torch/include/ATen/cuda/AsmUtils.cuh torch/include/ATen/cuda/ATenCUDAGeneral.h torch/include/ATen/cuda/Atomic.cuh torch/include/ATen/cuda/CachingHostAllocator.h torch/include/ATen/cuda/cub.cuh torch/include/ATen/cuda/cub.h torch/include/ATen/cuda/cub_definitions.cuh torch/include/ATen/cuda/CUDAApplyUtils.cuh torch/include/ATen/cuda/CUDABlas.h torch/include/ATen/cuda/CUDAConfig.h torch/include/ATen/cuda/CUDAContext.h torch/include/ATen/cuda/CUDAContextLight.h torch/include/ATen/cuda/CUDADataType.h torch/include/ATen/cuda/CUDADevice.h torch/include/ATen/cuda/CUDAEvent.h torch/include/ATen/cuda/CUDAGeneratorImpl.h torch/include/ATen/cuda/CUDAGraph.h torch/include/ATen/cuda/CUDAGraphsUtils.cuh torch/include/ATen/cuda/CUDASparse.h torch/include/ATen/cuda/CUDASparseBlas.h torch/include/ATen/cuda/CUDASparseDescriptors.h torch/include/ATen/cuda/CUDATensorMethods.cuh torch/include/ATen/cuda/CUDAUtils.h torch/include/ATen/cuda/detail torch/include/ATen/cuda/detail/CUDAHooks.h torch/include/ATen/cuda/detail/DeviceThreadHandles.h torch/include/ATen/cuda/detail/IndexUtils.cuh torch/include/ATen/cuda/detail/IntegerDivider.cuh torch/include/ATen/cuda/detail/KernelUtils.h torch/include/ATen/cuda/detail/LazyNVRTC.h torch/include/ATen/cuda/detail/OffsetCalculator.cuh torch/include/ATen/cuda/detail/PhiloxCudaStateRaw.cuh torch/include/ATen/cuda/detail/TensorInfo.cuh torch/include/ATen/cuda/detail/UnpackRaw.cuh torch/include/ATen/cuda/DeviceUtils.cuh torch/include/ATen/cuda/EmptyTensor.h torch/include/ATen/cuda/Exceptions.h torch/include/ATen/cuda/jiterator.h torch/include/ATen/cuda/jiterator_impl.h torch/include/ATen/cuda/llvm_jit_strings.h torch/include/ATen/cuda/NumericLimits.cuh torch/include/ATen/cuda/PeerToPeerAccess.h torch/include/ATen/cuda/PhiloxCudaState.h torch/include/ATen/cuda/PhiloxUtils.cuh torch/include/ATen/cuda/PinnedMemoryAllocator.h torch/include/ATen/cuda/ScanUtils.cuh torch/include/ATen/cuda/Sleep.h torch/include/ATen/cuda/ThrustAllocator.h torch/include/ATen/cuda/tunable torch/include/ATen/cuda/tunable/GemmCommon.h torch/include/ATen/cuda/tunable/GemmHipblaslt.h torch/include/ATen/cuda/tunable/GemmRocblas.h torch/include/ATen/cuda/tunable/StreamTimer.h torch/include/ATen/cuda/tunable/Tunable.h torch/include/ATen/cuda/tunable/TunableGemm.h torch/include/ATen/cuda/tunable/TunableOp.h torch/include/ATen/native/cuda torch/include/ATen/native/cuda/Activation.h torch/include/ATen/native/cuda/BinaryInternal.h torch/include/ATen/native/cuda/block_reduce.cuh torch/include/ATen/native/cuda/CompositeRandomAccessor.h torch/include/ATen/native/cuda/Copy.h torch/include/ATen/native/cuda/CUDAJitLoops.cuh torch/include/ATen/native/cuda/CUDALoops.cuh torch/include/ATen/native/cuda/CuFFTPlanCache.h torch/include/ATen/native/cuda/CuFFTUtils.h torch/include/ATen/native/cuda/cutlass_utils.cuh torch/include/ATen/native/cuda/DeviceSqrt.cuh torch/include/ATen/native/cuda/Distributions.h torch/include/ATen/native/cuda/DistributionTemplates.h torch/include/ATen/native/cuda/EmbeddingBackwardKernel.cuh torch/include/ATen/native/cuda/ForeachFunctors.cuh torch/include/ATen/native/cuda/ForeachMinMaxFunctors.cuh torch/include/ATen/native/cuda/fused_adamw_amsgrad_impl.cuh torch/include/ATen/native/cuda/fused_adamw_impl.cuh torch/include/ATen/native/cuda/fused_adam_amsgrad_impl.cuh torch/include/ATen/native/cuda/fused_adam_impl.cuh torch/include/ATen/native/cuda/fused_adam_utils.cuh torch/include/ATen/native/cuda/GridSampler.cuh torch/include/ATen/native/cuda/GridSampler.h torch/include/ATen/native/cuda/im2col.cuh torch/include/ATen/native/cuda/IndexKernel.h torch/include/ATen/native/cuda/JitLoops.cuh torch/include/ATen/native/cuda/jit_utils.h torch/include/ATen/native/cuda/KernelUtils.cuh torch/include/ATen/native/cuda/LaunchUtils.h torch/include/ATen/native/cuda/Loops.cuh torch/include/ATen/native/cuda/Math.cuh torch/include/ATen/native/cuda/MemoryAccess.cuh torch/include/ATen/native/cuda/MiscUtils.h torch/include/ATen/native/cuda/MultiTensorApply.cuh torch/include/ATen/native/cuda/Normalization.cuh torch/include/ATen/native/cuda/PersistentSoftmax.cuh torch/include/ATen/native/cuda/Pow.cuh torch/include/ATen/native/cuda/Randperm.cuh torch/include/ATen/native/cuda/Reduce.cuh torch/include/ATen/native/cuda/ReduceOps.h torch/include/ATen/native/cuda/reduction_template.cuh torch/include/ATen/native/cuda/Resize.h torch/include/ATen/native/cuda/RowwiseScaledMM.h torch/include/ATen/native/cuda/ScaledGroupMM.h torch/include/ATen/native/cuda/ScanKernels.h torch/include/ATen/native/cuda/ScanUtils.cuh torch/include/ATen/native/cuda/Sort.h torch/include/ATen/native/cuda/Sorting.h torch/include/ATen/native/cuda/SortingCommon.cuh torch/include/ATen/native/cuda/SortingRadixSelect.cuh torch/include/ATen/native/cuda/SortStable.h torch/include/ATen/native/cuda/SortUtils.cuh torch/include/ATen/native/cuda/TensorModeKernel.cuh torch/include/ATen/native/cuda/TensorModeKernel.h torch/include/ATen/native/cuda/TensorTopK.h torch/include/ATen/native/cuda/thread_constants.h torch/include/ATen/native/cuda/UniqueCub.cuh torch/include/ATen/native/cuda/UpSample.cuh torch/include/ATen/native/cuda/vol2col.cuh torch/include/ATen/native/transformers/cuda torch/include/ATen/native/transformers/cuda/flash_attn torch/include/ATen/native/transformers/cuda/flash_attn/flash_api.h torch/include/ATen/native/transformers/cuda/flash_attn/static_switch.h torch/include/ATen/native/transformers/cuda/mem_eff_attention torch/include/ATen/native/transformers/cuda/mem_eff_attention/debug_utils.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/epilogue torch/include/ATen/native/transformers/cuda/mem_eff_attention/epilogue/epilogue_pipelined.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/epilogue/epilogue_rescale_output.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/epilogue/epilogue_thread_apply_logsumexp.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_base.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_multistage.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_pipelined.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm/find_default_mma.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm/mma_accum_lambda_iterator.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm/mma_from_smem.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators/default_warp_iterator_from_smem.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators/epilogue_predicated_tile_iterator.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators/make_residual_last.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_access_iterator_residual_last.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_iterator_residual_last.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators/transpose_warp_iterator.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/iterators/warp_iterator_from_smem.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/kernels torch/include/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassF.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/pytorch_utils.h torch/include/ATen/native/transformers/cuda/mem_eff_attention/transform torch/include/ATen/native/transformers/cuda/mem_eff_attention/transform/tile_smem_loader.h torch/include/ATen/native/transformers/cuda/sdp_utils.h torch/include/ATen/ops/abs_cuda_dispatch.h torch/include/ATen/ops/acosh_cuda_dispatch.h torch/include/ATen/ops/acos_cuda_dispatch.h torch/include/ATen/ops/adaptive_avg_pool2d_cuda_dispatch.h torch/include/ATen/ops/adaptive_avg_pool3d_backward_cuda_dispatch.h torch/include/ATen/ops/adaptive_avg_pool3d_cuda_dispatch.h torch/include/ATen/ops/adaptive_max_pool2d_backward_cuda_dispatch.h torch/include/ATen/ops/adaptive_max_pool2d_cuda_dispatch.h torch/include/ATen/ops/adaptive_max_pool3d_backward_cuda_dispatch.h torch/include/ATen/ops/adaptive_max_pool3d_cuda_dispatch.h torch/include/ATen/ops/addbmm_cuda_dispatch.h torch/include/ATen/ops/addcdiv_cuda_dispatch.h torch/include/ATen/ops/addcmul_cuda_dispatch.h torch/include/ATen/ops/addmm_cuda_dispatch.h torch/include/ATen/ops/addmv_cuda_dispatch.h torch/include/ATen/ops/addr_cuda_dispatch.h torch/include/ATen/ops/add_cuda_dispatch.h torch/include/ATen/ops/all_cuda_dispatch.h torch/include/ATen/ops/amax_cuda_dispatch.h torch/include/ATen/ops/aminmax_cuda_dispatch.h torch/include/ATen/ops/amin_cuda_dispatch.h torch/include/ATen/ops/angle_cuda_dispatch.h torch/include/ATen/ops/any_cuda_dispatch.h torch/include/ATen/ops/arange_cuda_dispatch.h torch/include/ATen/ops/argmax_cuda_dispatch.h torch/include/ATen/ops/argmin_cuda_dispatch.h torch/include/ATen/ops/asinh_cuda_dispatch.h torch/include/ATen/ops/asin_cuda_dispatch.h torch/include/ATen/ops/as_strided_cuda_dispatch.h torch/include/ATen/ops/atan2_cuda_dispatch.h torch/include/ATen/ops/atanh_cuda_dispatch.h torch/include/ATen/ops/atan_cuda_dispatch.h torch/include/ATen/ops/avg_pool2d_backward_cuda_dispatch.h torch/include/ATen/ops/avg_pool2d_cuda_dispatch.h torch/include/ATen/ops/avg_pool3d_backward_cuda_dispatch.h torch/include/ATen/ops/avg_pool3d_cuda_dispatch.h torch/include/ATen/ops/baddbmm_cuda_dispatch.h torch/include/ATen/ops/batch_norm_backward_cuda_dispatch.h torch/include/ATen/ops/batch_norm_backward_elemt_cuda_dispatch.h torch/include/ATen/ops/batch_norm_backward_reduce_cuda_dispatch.h torch/include/ATen/ops/batch_norm_elemt_cuda_dispatch.h torch/include/ATen/ops/batch_norm_gather_stats_cuda_dispatch.h torch/include/ATen/ops/batch_norm_gather_stats_with_counts_cuda_dispatch.h torch/include/ATen/ops/batch_norm_stats_cuda_dispatch.h torch/include/ATen/ops/batch_norm_update_stats_cuda_dispatch.h torch/include/ATen/ops/bernoulli_cuda_dispatch.h torch/include/ATen/ops/binary_cross_entropy_backward_cuda_dispatch.h torch/include/ATen/ops/binary_cross_entropy_cuda_dispatch.h torch/include/ATen/ops/bincount_cuda_dispatch.h torch/include/ATen/ops/binomial_cuda_dispatch.h torch/include/ATen/ops/bitwise_and_cuda_dispatch.h torch/include/ATen/ops/bitwise_left_shift_cuda_dispatch.h torch/include/ATen/ops/bitwise_not_cuda_dispatch.h torch/include/ATen/ops/bitwise_or_cuda_dispatch.h torch/include/ATen/ops/bitwise_right_shift_cuda_dispatch.h torch/include/ATen/ops/bitwise_xor_cuda_dispatch.h torch/include/ATen/ops/bmm_cuda_dispatch.h torch/include/ATen/ops/bucketize_cuda_dispatch.h torch/include/ATen/ops/cat_cuda_dispatch.h torch/include/ATen/ops/cauchy_cuda_dispatch.h torch/include/ATen/ops/ceil_cuda_dispatch.h torch/include/ATen/ops/channel_shuffle_cuda_dispatch.h torch/include/ATen/ops/cholesky_cuda_dispatch.h torch/include/ATen/ops/cholesky_inverse_cuda_dispatch.h torch/include/ATen/ops/clamp_cuda_dispatch.h torch/include/ATen/ops/clamp_max_cuda_dispatch.h torch/include/ATen/ops/clamp_min_cuda_dispatch.h torch/include/ATen/ops/col2im_cuda_dispatch.h torch/include/ATen/ops/complex_cuda_dispatch.h torch/include/ATen/ops/conj_physical_cuda_dispatch.h torch/include/ATen/ops/convolution_backward_cuda_dispatch.h torch/include/ATen/ops/conv_depthwise3d_cuda_dispatch.h torch/include/ATen/ops/copysign_cuda_dispatch.h torch/include/ATen/ops/cosh_cuda_dispatch.h torch/include/ATen/ops/cos_cuda_dispatch.h torch/include/ATen/ops/count_nonzero_cuda_dispatch.h torch/include/ATen/ops/cudnn_affine_grid_generator_backward_cuda_dispatch.h torch/include/ATen/ops/cudnn_affine_grid_generator_cuda_dispatch.h torch/include/ATen/ops/cudnn_batch_norm_backward_cuda_dispatch.h torch/include/ATen/ops/cudnn_batch_norm_cuda_dispatch.h torch/include/ATen/ops/cudnn_convolution_add_relu_cuda_dispatch.h torch/include/ATen/ops/cudnn_convolution_cuda_dispatch.h torch/include/ATen/ops/cudnn_convolution_relu_cuda_dispatch.h torch/include/ATen/ops/cudnn_convolution_transpose_cuda_dispatch.h torch/include/ATen/ops/cudnn_grid_sampler_backward_cuda_dispatch.h torch/include/ATen/ops/cudnn_grid_sampler_cuda_dispatch.h torch/include/ATen/ops/cumprod_cuda_dispatch.h torch/include/ATen/ops/cumsum_cuda_dispatch.h torch/include/ATen/ops/dequantize_cuda_dispatch.h torch/include/ATen/ops/digamma_cuda_dispatch.h torch/include/ATen/ops/div_cuda_dispatch.h torch/include/ATen/ops/dot_cuda_dispatch.h torch/include/ATen/ops/elu_backward_cuda_dispatch.h torch/include/ATen/ops/elu_cuda_dispatch.h torch/include/ATen/ops/embedding_dense_backward_cuda_dispatch.h torch/include/ATen/ops/embedding_renorm_cuda_dispatch.h torch/include/ATen/ops/empty_cuda_dispatch.h torch/include/ATen/ops/empty_strided_cuda_dispatch.h torch/include/ATen/ops/equal_cuda_dispatch.h torch/include/ATen/ops/eq_cuda_dispatch.h torch/include/ATen/ops/erfc_cuda_dispatch.h torch/include/ATen/ops/erfinv_cuda_dispatch.h torch/include/ATen/ops/erf_cuda_dispatch.h torch/include/ATen/ops/exp2_cuda_dispatch.h torch/include/ATen/ops/expm1_cuda_dispatch.h torch/include/ATen/ops/exponential_cuda_dispatch.h torch/include/ATen/ops/exp_cuda_dispatch.h torch/include/ATen/ops/eye_cuda_dispatch.h torch/include/ATen/ops/fake_quantize_per_channel_affine_cachemask_cuda_dispatch.h torch/include/ATen/ops/fake_quantize_per_tensor_affine_cachemask_cuda_dispatch.h torch/include/ATen/ops/fill_cuda_dispatch.h torch/include/ATen/ops/flip_cuda_dispatch.h torch/include/ATen/ops/floor_cuda_dispatch.h torch/include/ATen/ops/floor_divide_cuda_dispatch.h torch/include/ATen/ops/fmax_cuda_dispatch.h torch/include/ATen/ops/fmin_cuda_dispatch.h torch/include/ATen/ops/fmod_cuda_dispatch.h torch/include/ATen/ops/fractional_max_pool2d_backward_cuda_dispatch.h torch/include/ATen/ops/fractional_max_pool2d_cuda_dispatch.h torch/include/ATen/ops/fractional_max_pool3d_backward_cuda_dispatch.h torch/include/ATen/ops/fractional_max_pool3d_cuda_dispatch.h torch/include/ATen/ops/frac_cuda_dispatch.h torch/include/ATen/ops/frexp_cuda_dispatch.h torch/include/ATen/ops/gather_cuda_dispatch.h torch/include/ATen/ops/gcd_cuda_dispatch.h torch/include/ATen/ops/gelu_backward_cuda_dispatch.h torch/include/ATen/ops/gelu_cuda_dispatch.h torch/include/ATen/ops/geometric_cuda_dispatch.h torch/include/ATen/ops/geqrf_cuda_dispatch.h torch/include/ATen/ops/ge_cuda_dispatch.h torch/include/ATen/ops/glu_backward_cuda_dispatch.h torch/include/ATen/ops/glu_backward_jvp_cuda_dispatch.h torch/include/ATen/ops/glu_cuda_dispatch.h torch/include/ATen/ops/glu_jvp_cuda_dispatch.h torch/include/ATen/ops/grid_sampler_2d_backward_cuda_dispatch.h torch/include/ATen/ops/grid_sampler_2d_cuda_dispatch.h torch/include/ATen/ops/grid_sampler_3d_backward_cuda_dispatch.h torch/include/ATen/ops/grid_sampler_3d_cuda_dispatch.h torch/include/ATen/ops/gt_cuda_dispatch.h torch/include/ATen/ops/hardshrink_backward_cuda_dispatch.h torch/include/ATen/ops/hardshrink_cuda_dispatch.h torch/include/ATen/ops/hardsigmoid_backward_cuda_dispatch.h torch/include/ATen/ops/hardsigmoid_cuda_dispatch.h torch/include/ATen/ops/hardswish_backward_cuda_dispatch.h torch/include/ATen/ops/hardswish_cuda_dispatch.h torch/include/ATen/ops/hardtanh_backward_cuda_dispatch.h torch/include/ATen/ops/hardtanh_cuda_dispatch.h torch/include/ATen/ops/heaviside_cuda_dispatch.h torch/include/ATen/ops/histc_cuda_dispatch.h torch/include/ATen/ops/huber_loss_backward_cuda_dispatch.h torch/include/ATen/ops/huber_loss_cuda_dispatch.h torch/include/ATen/ops/hypot_cuda_dispatch.h torch/include/ATen/ops/i0_cuda_dispatch.h torch/include/ATen/ops/igammac_cuda_dispatch.h torch/include/ATen/ops/igamma_cuda_dispatch.h torch/include/ATen/ops/im2col_cuda_dispatch.h torch/include/ATen/ops/index_add_cuda_dispatch.h torch/include/ATen/ops/index_copy_cuda_dispatch.h torch/include/ATen/ops/index_cuda_dispatch.h torch/include/ATen/ops/index_fill_cuda_dispatch.h torch/include/ATen/ops/index_reduce_cuda_dispatch.h torch/include/ATen/ops/index_select_cuda_dispatch.h torch/include/ATen/ops/isin_cuda_dispatch.h torch/include/ATen/ops/isnan_cuda_dispatch.h torch/include/ATen/ops/isneginf_cuda_dispatch.h torch/include/ATen/ops/isposinf_cuda_dispatch.h torch/include/ATen/ops/is_set_to_cuda_dispatch.h torch/include/ATen/ops/kthvalue_cuda_dispatch.h torch/include/ATen/ops/lcm_cuda_dispatch.h torch/include/ATen/ops/leaky_relu_backward_cuda_dispatch.h torch/include/ATen/ops/leaky_relu_cuda_dispatch.h torch/include/ATen/ops/lerp_cuda_dispatch.h torch/include/ATen/ops/le_cuda_dispatch.h torch/include/ATen/ops/lgamma_cuda_dispatch.h torch/include/ATen/ops/linalg_cholesky_ex_cuda_dispatch.h torch/include/ATen/ops/linalg_cross_cuda_dispatch.h torch/include/ATen/ops/linalg_eigvals_cuda_dispatch.h torch/include/ATen/ops/linalg_eig_cuda_dispatch.h torch/include/ATen/ops/linalg_householder_product_cuda_dispatch.h torch/include/ATen/ops/linalg_inv_ex_cuda_dispatch.h torch/include/ATen/ops/linalg_ldl_factor_ex_cuda_dispatch.h torch/include/ATen/ops/linalg_ldl_solve_cuda_dispatch.h torch/include/ATen/ops/linalg_lstsq_cuda_dispatch.h torch/include/ATen/ops/linalg_lu_cuda_dispatch.h torch/include/ATen/ops/linalg_lu_factor_ex_cuda_dispatch.h torch/include/ATen/ops/linalg_lu_solve_cuda_dispatch.h torch/include/ATen/ops/linalg_matrix_exp_cuda_dispatch.h torch/include/ATen/ops/linalg_qr_cuda_dispatch.h torch/include/ATen/ops/linalg_solve_triangular_cuda_dispatch.h torch/include/ATen/ops/linalg_vector_norm_cuda_dispatch.h torch/include/ATen/ops/linspace_cuda_dispatch.h torch/include/ATen/ops/log10_cuda_dispatch.h torch/include/ATen/ops/log1p_cuda_dispatch.h torch/include/ATen/ops/log2_cuda_dispatch.h torch/include/ATen/ops/logaddexp2_cuda_dispatch.h torch/include/ATen/ops/logaddexp_cuda_dispatch.h torch/include/ATen/ops/logical_and_cuda_dispatch.h torch/include/ATen/ops/logical_not_cuda_dispatch.h torch/include/ATen/ops/logical_or_cuda_dispatch.h torch/include/ATen/ops/logical_xor_cuda_dispatch.h torch/include/ATen/ops/logit_backward_cuda_dispatch.h torch/include/ATen/ops/logit_cuda_dispatch.h torch/include/ATen/ops/logspace_cuda_dispatch.h torch/include/ATen/ops/log_cuda_dispatch.h torch/include/ATen/ops/log_normal_cuda_dispatch.h torch/include/ATen/ops/log_sigmoid_backward_cuda_dispatch.h torch/include/ATen/ops/log_sigmoid_forward_cuda_dispatch.h torch/include/ATen/ops/lshift_cuda_dispatch.h torch/include/ATen/ops/lt_cuda_dispatch.h torch/include/ATen/ops/lu_unpack_cuda_dispatch.h torch/include/ATen/ops/masked_fill_cuda_dispatch.h torch/include/ATen/ops/masked_scatter_cuda_dispatch.h torch/include/ATen/ops/masked_select_cuda_dispatch.h torch/include/ATen/ops/maximum_cuda_dispatch.h torch/include/ATen/ops/max_cuda_dispatch.h torch/include/ATen/ops/max_pool2d_with_indices_backward_cuda_dispatch.h torch/include/ATen/ops/max_pool2d_with_indices_cuda_dispatch.h torch/include/ATen/ops/max_pool3d_with_indices_backward_cuda_dispatch.h torch/include/ATen/ops/max_pool3d_with_indices_cuda_dispatch.h torch/include/ATen/ops/max_unpool2d_cuda_dispatch.h torch/include/ATen/ops/max_unpool3d_cuda_dispatch.h torch/include/ATen/ops/mean_cuda_dispatch.h torch/include/ATen/ops/median_cuda_dispatch.h torch/include/ATen/ops/minimum_cuda_dispatch.h torch/include/ATen/ops/min_cuda_dispatch.h torch/include/ATen/ops/miopen_batch_norm_backward_cuda_dispatch.h torch/include/ATen/ops/miopen_batch_norm_cuda_dispatch.h torch/include/ATen/ops/miopen_convolution_add_relu_cuda_dispatch.h torch/include/ATen/ops/miopen_convolution_cuda_dispatch.h torch/include/ATen/ops/miopen_convolution_relu_cuda_dispatch.h torch/include/ATen/ops/miopen_convolution_transpose_cuda_dispatch.h torch/include/ATen/ops/miopen_depthwise_convolution_cuda_dispatch.h torch/include/ATen/ops/miopen_rnn_backward_cuda_dispatch.h torch/include/ATen/ops/miopen_rnn_cuda_dispatch.h torch/include/ATen/ops/mish_backward_cuda_dispatch.h torch/include/ATen/ops/mish_cuda_dispatch.h torch/include/ATen/ops/mm_cuda_dispatch.h torch/include/ATen/ops/mode_cuda_dispatch.h torch/include/ATen/ops/mse_loss_backward_cuda_dispatch.h torch/include/ATen/ops/mse_loss_cuda_dispatch.h torch/include/ATen/ops/multilabel_margin_loss_backward_cuda_dispatch.h torch/include/ATen/ops/multilabel_margin_loss_forward_cuda_dispatch.h torch/include/ATen/ops/multinomial_cuda_dispatch.h torch/include/ATen/ops/multi_margin_loss_backward_cuda_dispatch.h torch/include/ATen/ops/multi_margin_loss_cuda_dispatch.h torch/include/ATen/ops/mul_cuda_dispatch.h torch/include/ATen/ops/mvlgamma_cuda_dispatch.h torch/include/ATen/ops/nanmedian_cuda_dispatch.h torch/include/ATen/ops/nansum_cuda_dispatch.h torch/include/ATen/ops/nan_to_num_cuda_dispatch.h torch/include/ATen/ops/native_batch_norm_backward_cuda_dispatch.h torch/include/ATen/ops/native_batch_norm_cuda_dispatch.h torch/include/ATen/ops/native_dropout_backward_cuda_dispatch.h torch/include/ATen/ops/native_dropout_cuda_dispatch.h torch/include/ATen/ops/native_group_norm_backward_cuda_dispatch.h torch/include/ATen/ops/native_group_norm_cuda_dispatch.h torch/include/ATen/ops/native_layer_norm_backward_cuda_dispatch.h torch/include/ATen/ops/native_layer_norm_cuda_dispatch.h torch/include/ATen/ops/neg_cuda_dispatch.h torch/include/ATen/ops/nextafter_cuda_dispatch.h torch/include/ATen/ops/ne_cuda_dispatch.h torch/include/ATen/ops/nll_loss2d_backward_cuda_dispatch.h torch/include/ATen/ops/nll_loss2d_forward_cuda_dispatch.h torch/include/ATen/ops/nll_loss_backward_cuda_dispatch.h torch/include/ATen/ops/nll_loss_forward_cuda_dispatch.h torch/include/ATen/ops/nonzero_cuda_dispatch.h torch/include/ATen/ops/nonzero_static_cuda_dispatch.h torch/include/ATen/ops/normal_cuda_dispatch.h torch/include/ATen/ops/norm_cuda_dispatch.h torch/include/ATen/ops/ormqr_cuda_dispatch.h torch/include/ATen/ops/poisson_cuda_dispatch.h torch/include/ATen/ops/polar_cuda_dispatch.h torch/include/ATen/ops/polygamma_cuda_dispatch.h torch/include/ATen/ops/pow_cuda_dispatch.h torch/include/ATen/ops/prod_cuda_dispatch.h torch/include/ATen/ops/put_cuda_dispatch.h torch/include/ATen/ops/quantize_per_channel_cuda_dispatch.h torch/include/ATen/ops/quantize_per_tensor_cuda_dispatch.h torch/include/ATen/ops/quantize_per_tensor_dynamic_cuda_dispatch.h torch/include/ATen/ops/random_cuda_dispatch.h torch/include/ATen/ops/randperm_cuda_dispatch.h torch/include/ATen/ops/range_cuda_dispatch.h torch/include/ATen/ops/reciprocal_cuda_dispatch.h torch/include/ATen/ops/record_stream_cuda_dispatch.h torch/include/ATen/ops/reflection_pad1d_backward_cuda_dispatch.h torch/include/ATen/ops/reflection_pad1d_cuda_dispatch.h torch/include/ATen/ops/reflection_pad2d_backward_cuda_dispatch.h torch/include/ATen/ops/reflection_pad2d_cuda_dispatch.h torch/include/ATen/ops/reflection_pad3d_backward_cuda_dispatch.h torch/include/ATen/ops/reflection_pad3d_cuda_dispatch.h torch/include/ATen/ops/relu_cuda_dispatch.h torch/include/ATen/ops/remainder_cuda_dispatch.h torch/include/ATen/ops/renorm_cuda_dispatch.h torch/include/ATen/ops/repeat_interleave_cuda_dispatch.h torch/include/ATen/ops/replication_pad1d_backward_cuda_dispatch.h torch/include/ATen/ops/replication_pad1d_cuda_dispatch.h torch/include/ATen/ops/replication_pad2d_backward_cuda_dispatch.h torch/include/ATen/ops/replication_pad2d_cuda_dispatch.h torch/include/ATen/ops/replication_pad3d_backward_cuda_dispatch.h torch/include/ATen/ops/replication_pad3d_cuda_dispatch.h torch/include/ATen/ops/resize_cuda_dispatch.h torch/include/ATen/ops/roll_cuda_dispatch.h torch/include/ATen/ops/round_cuda_dispatch.h torch/include/ATen/ops/rrelu_with_noise_cuda_dispatch.h torch/include/ATen/ops/rshift_cuda_dispatch.h torch/include/ATen/ops/rsqrt_cuda_dispatch.h torch/include/ATen/ops/rsub_cuda_dispatch.h torch/include/ATen/ops/scatter_add_cuda_dispatch.h torch/include/ATen/ops/scatter_cuda_dispatch.h torch/include/ATen/ops/scatter_reduce_cuda_dispatch.h torch/include/ATen/ops/searchsorted_cuda_dispatch.h torch/include/ATen/ops/segment_reduce_cuda_dispatch.h torch/include/ATen/ops/set_cuda_dispatch.h torch/include/ATen/ops/sgn_cuda_dispatch.h torch/include/ATen/ops/sigmoid_backward_cuda_dispatch.h torch/include/ATen/ops/sigmoid_cuda_dispatch.h torch/include/ATen/ops/signbit_cuda_dispatch.h torch/include/ATen/ops/sign_cuda_dispatch.h torch/include/ATen/ops/silu_backward_cuda_dispatch.h torch/include/ATen/ops/silu_cuda_dispatch.h torch/include/ATen/ops/sinc_cuda_dispatch.h torch/include/ATen/ops/sinh_cuda_dispatch.h torch/include/ATen/ops/sin_cuda_dispatch.h torch/include/ATen/ops/slow_conv_dilated2d_cuda_dispatch.h torch/include/ATen/ops/slow_conv_dilated3d_cuda_dispatch.h torch/include/ATen/ops/slow_conv_transpose2d_cuda_dispatch.h torch/include/ATen/ops/slow_conv_transpose3d_cuda_dispatch.h torch/include/ATen/ops/smooth_l1_loss_backward_cuda_dispatch.h torch/include/ATen/ops/smooth_l1_loss_cuda_dispatch.h torch/include/ATen/ops/softplus_backward_cuda_dispatch.h torch/include/ATen/ops/softplus_cuda_dispatch.h torch/include/ATen/ops/softshrink_backward_cuda_dispatch.h torch/include/ATen/ops/softshrink_cuda_dispatch.h torch/include/ATen/ops/sort_cuda_dispatch.h torch/include/ATen/ops/special_airy_ai_cuda_dispatch.h torch/include/ATen/ops/special_bessel_j0_cuda_dispatch.h torch/include/ATen/ops/special_bessel_j1_cuda_dispatch.h torch/include/ATen/ops/special_bessel_y0_cuda_dispatch.h torch/include/ATen/ops/special_bessel_y1_cuda_dispatch.h torch/include/ATen/ops/special_chebyshev_polynomial_t_cuda_dispatch.h torch/include/ATen/ops/special_chebyshev_polynomial_u_cuda_dispatch.h torch/include/ATen/ops/special_chebyshev_polynomial_v_cuda_dispatch.h torch/include/ATen/ops/special_chebyshev_polynomial_w_cuda_dispatch.h torch/include/ATen/ops/special_entr_cuda_dispatch.h torch/include/ATen/ops/special_erfcx_cuda_dispatch.h torch/include/ATen/ops/special_hermite_polynomial_he_cuda_dispatch.h torch/include/ATen/ops/special_hermite_polynomial_h_cuda_dispatch.h torch/include/ATen/ops/special_i0e_cuda_dispatch.h torch/include/ATen/ops/special_i1e_cuda_dispatch.h torch/include/ATen/ops/special_i1_cuda_dispatch.h torch/include/ATen/ops/special_laguerre_polynomial_l_cuda_dispatch.h torch/include/ATen/ops/special_legendre_polynomial_p_cuda_dispatch.h torch/include/ATen/ops/special_log_ndtr_cuda_dispatch.h torch/include/ATen/ops/special_modified_bessel_i0_cuda_dispatch.h torch/include/ATen/ops/special_modified_bessel_i1_cuda_dispatch.h torch/include/ATen/ops/special_modified_bessel_k0_cuda_dispatch.h torch/include/ATen/ops/special_modified_bessel_k1_cuda_dispatch.h torch/include/ATen/ops/special_ndtri_cuda_dispatch.h torch/include/ATen/ops/special_scaled_modified_bessel_k0_cuda_dispatch.h torch/include/ATen/ops/special_scaled_modified_bessel_k1_cuda_dispatch.h torch/include/ATen/ops/special_shifted_chebyshev_polynomial_t_cuda_dispatch.h torch/include/ATen/ops/special_shifted_chebyshev_polynomial_u_cuda_dispatch.h torch/include/ATen/ops/special_shifted_chebyshev_polynomial_v_cuda_dispatch.h torch/include/ATen/ops/special_shifted_chebyshev_polynomial_w_cuda_dispatch.h torch/include/ATen/ops/special_spherical_bessel_j0_cuda_dispatch.h torch/include/ATen/ops/special_xlog1py_cuda_dispatch.h torch/include/ATen/ops/special_zeta_cuda_dispatch.h torch/include/ATen/ops/split_with_sizes_copy_cuda_dispatch.h torch/include/ATen/ops/sqrt_cuda_dispatch.h torch/include/ATen/ops/sspaddmm_cuda_dispatch.h torch/include/ATen/ops/std_cuda_dispatch.h torch/include/ATen/ops/std_mean_cuda_dispatch.h torch/include/ATen/ops/sub_cuda_dispatch.h torch/include/ATen/ops/sum_cuda_dispatch.h torch/include/ATen/ops/take_cuda_dispatch.h torch/include/ATen/ops/tanh_backward_cuda_dispatch.h torch/include/ATen/ops/tanh_cuda_dispatch.h torch/include/ATen/ops/tan_cuda_dispatch.h torch/include/ATen/ops/threshold_backward_cuda_dispatch.h torch/include/ATen/ops/threshold_cuda_dispatch.h torch/include/ATen/ops/topk_cuda_dispatch.h torch/include/ATen/ops/trace_cuda_dispatch.h torch/include/ATen/ops/triangular_solve_cuda_dispatch.h torch/include/ATen/ops/tril_cuda_dispatch.h torch/include/ATen/ops/tril_indices_cuda_dispatch.h torch/include/ATen/ops/triu_cuda_dispatch.h torch/include/ATen/ops/triu_indices_cuda_dispatch.h torch/include/ATen/ops/trunc_cuda_dispatch.h torch/include/ATen/ops/unfold_backward_cuda_dispatch.h torch/include/ATen/ops/unfold_cuda_dispatch.h torch/include/ATen/ops/uniform_cuda_dispatch.h torch/include/ATen/ops/unique_consecutive_cuda_dispatch.h torch/include/ATen/ops/unique_dim_consecutive_cuda_dispatch.h torch/include/ATen/ops/unique_dim_cuda_dispatch.h torch/include/ATen/ops/upsample_bicubic2d_backward_cuda_dispatch.h torch/include/ATen/ops/upsample_bicubic2d_cuda_dispatch.h torch/include/ATen/ops/upsample_bilinear2d_backward_cuda_dispatch.h torch/include/ATen/ops/upsample_bilinear2d_cuda_dispatch.h torch/include/ATen/ops/upsample_linear1d_backward_cuda_dispatch.h torch/include/ATen/ops/upsample_linear1d_cuda_dispatch.h torch/include/ATen/ops/upsample_nearest1d_backward_cuda_dispatch.h torch/include/ATen/ops/upsample_nearest1d_cuda_dispatch.h torch/include/ATen/ops/upsample_nearest2d_backward_cuda_dispatch.h torch/include/ATen/ops/upsample_nearest2d_cuda_dispatch.h torch/include/ATen/ops/upsample_nearest3d_backward_cuda_dispatch.h torch/include/ATen/ops/upsample_nearest3d_cuda_dispatch.h torch/include/ATen/ops/upsample_trilinear3d_backward_cuda_dispatch.h torch/include/ATen/ops/upsample_trilinear3d_cuda_dispatch.h torch/include/ATen/ops/var_cuda_dispatch.h torch/include/ATen/ops/var_mean_cuda_dispatch.h torch/include/ATen/ops/vdot_cuda_dispatch.h torch/include/ATen/ops/view_as_complex_cuda_dispatch.h torch/include/ATen/ops/view_as_real_cuda_dispatch.h torch/include/ATen/ops/view_cuda_dispatch.h torch/include/ATen/ops/where_cuda_dispatch.h torch/include/ATen/ops/xlogy_cuda_dispatch.h torch/include/ATen/ops/zero_cuda_dispatch.h torch/include/ATen/ops/_adaptive_avg_pool2d_backward_cuda_dispatch.h torch/include/ATen/ops/_adaptive_avg_pool2d_cuda_dispatch.h torch/include/ATen/ops/_adaptive_avg_pool3d_backward_cuda_dispatch.h torch/include/ATen/ops/_adaptive_avg_pool3d_cuda_dispatch.h torch/include/ATen/ops/_addmm_activation_cuda_dispatch.h torch/include/ATen/ops/_aminmax_cuda_dispatch.h torch/include/ATen/ops/_amp_foreach_non_finite_check_and_unscale_cuda_dispatch.h torch/include/ATen/ops/_amp_update_scale_cuda_dispatch.h torch/include/ATen/ops/_assert_async_cuda_dispatch.h torch/include/ATen/ops/_batch_norm_with_update_cuda_dispatch.h torch/include/ATen/ops/_cdist_backward_cuda_dispatch.h torch/include/ATen/ops/_cdist_forward_cuda_dispatch.h torch/include/ATen/ops/_cholesky_solve_helper_cuda_dispatch.h torch/include/ATen/ops/_chunk_cat_cuda_dispatch.h torch/include/ATen/ops/_compute_linear_combination_cuda_dispatch.h torch/include/ATen/ops/_convert_indices_from_coo_to_csr_cuda_dispatch.h torch/include/ATen/ops/_convert_indices_from_csr_to_coo_cuda_dispatch.h torch/include/ATen/ops/_convert_weight_to_int4pack_cuda_dispatch.h torch/include/ATen/ops/_conv_depthwise2d_cuda_dispatch.h torch/include/ATen/ops/_cslt_compress_cuda_dispatch.h torch/include/ATen/ops/_cslt_sparse_mm_cuda_dispatch.h torch/include/ATen/ops/_cslt_sparse_mm_search_cuda_dispatch.h torch/include/ATen/ops/_ctc_loss_backward_cuda_dispatch.h torch/include/ATen/ops/_ctc_loss_cuda_dispatch.h torch/include/ATen/ops/_cudnn_attention_forward_cuda_dispatch.h torch/include/ATen/ops/_cudnn_ctc_loss_cuda_dispatch.h torch/include/ATen/ops/_cudnn_init_dropout_state_cuda_dispatch.h torch/include/ATen/ops/_cudnn_rnn_backward_cuda_dispatch.h torch/include/ATen/ops/_cudnn_rnn_cuda_dispatch.h torch/include/ATen/ops/_cudnn_rnn_flatten_weight_cuda_dispatch.h torch/include/ATen/ops/_cummax_helper_cuda_dispatch.h torch/include/ATen/ops/_cummin_helper_cuda_dispatch.h torch/include/ATen/ops/_dirichlet_grad_cuda_dispatch.h torch/include/ATen/ops/_efficientzerotensor_cuda_dispatch.h torch/include/ATen/ops/_efficient_attention_backward_cuda_dispatch.h torch/include/ATen/ops/_efficient_attention_forward_cuda_dispatch.h torch/include/ATen/ops/_embedding_bag_backward_cuda_dispatch.h torch/include/ATen/ops/_embedding_bag_cuda_dispatch.h torch/include/ATen/ops/_embedding_bag_dense_backward_cuda_dispatch.h torch/include/ATen/ops/_embedding_bag_forward_only_cuda_dispatch.h torch/include/ATen/ops/_embedding_bag_per_sample_weights_backward_cuda_dispatch.h torch/include/ATen/ops/_fake_quantize_learnable_per_channel_affine_backward_cuda_dispatch.h torch/include/ATen/ops/_fake_quantize_learnable_per_channel_affine_cuda_dispatch.h torch/include/ATen/ops/_fake_quantize_learnable_per_tensor_affine_backward_cuda_dispatch.h torch/include/ATen/ops/_fake_quantize_learnable_per_tensor_affine_cuda_dispatch.h torch/include/ATen/ops/_fake_quantize_per_tensor_affine_cachemask_tensor_qparams_cuda_dispatch.h torch/include/ATen/ops/_fft_c2c_cuda_dispatch.h torch/include/ATen/ops/_fft_c2r_cuda_dispatch.h torch/include/ATen/ops/_fft_r2c_cuda_dispatch.h torch/include/ATen/ops/_fill_mem_eff_dropout_mask_cuda_dispatch.h torch/include/ATen/ops/_flash_attention_backward_cuda_dispatch.h torch/include/ATen/ops/_flash_attention_forward_cuda_dispatch.h torch/include/ATen/ops/_foreach_abs_cuda_dispatch.h torch/include/ATen/ops/_foreach_acos_cuda_dispatch.h torch/include/ATen/ops/_foreach_addcdiv_cuda_dispatch.h torch/include/ATen/ops/_foreach_addcmul_cuda_dispatch.h torch/include/ATen/ops/_foreach_add_cuda_dispatch.h torch/include/ATen/ops/_foreach_asin_cuda_dispatch.h torch/include/ATen/ops/_foreach_atan_cuda_dispatch.h torch/include/ATen/ops/_foreach_ceil_cuda_dispatch.h torch/include/ATen/ops/_foreach_clamp_max_cuda_dispatch.h torch/include/ATen/ops/_foreach_clamp_min_cuda_dispatch.h torch/include/ATen/ops/_foreach_copy_cuda_dispatch.h torch/include/ATen/ops/_foreach_cosh_cuda_dispatch.h torch/include/ATen/ops/_foreach_cos_cuda_dispatch.h torch/include/ATen/ops/_foreach_div_cuda_dispatch.h torch/include/ATen/ops/_foreach_erfc_cuda_dispatch.h torch/include/ATen/ops/_foreach_erf_cuda_dispatch.h torch/include/ATen/ops/_foreach_expm1_cuda_dispatch.h torch/include/ATen/ops/_foreach_exp_cuda_dispatch.h torch/include/ATen/ops/_foreach_floor_cuda_dispatch.h torch/include/ATen/ops/_foreach_frac_cuda_dispatch.h torch/include/ATen/ops/_foreach_lerp_cuda_dispatch.h torch/include/ATen/ops/_foreach_lgamma_cuda_dispatch.h torch/include/ATen/ops/_foreach_log10_cuda_dispatch.h torch/include/ATen/ops/_foreach_log1p_cuda_dispatch.h torch/include/ATen/ops/_foreach_log2_cuda_dispatch.h torch/include/ATen/ops/_foreach_log_cuda_dispatch.h torch/include/ATen/ops/_foreach_maximum_cuda_dispatch.h torch/include/ATen/ops/_foreach_max_cuda_dispatch.h torch/include/ATen/ops/_foreach_minimum_cuda_dispatch.h torch/include/ATen/ops/_foreach_mul_cuda_dispatch.h torch/include/ATen/ops/_foreach_neg_cuda_dispatch.h torch/include/ATen/ops/_foreach_norm_cuda_dispatch.h torch/include/ATen/ops/_foreach_pow_cuda_dispatch.h torch/include/ATen/ops/_foreach_reciprocal_cuda_dispatch.h torch/include/ATen/ops/_foreach_round_cuda_dispatch.h torch/include/ATen/ops/_foreach_rsqrt_cuda_dispatch.h torch/include/ATen/ops/_foreach_sigmoid_cuda_dispatch.h torch/include/ATen/ops/_foreach_sign_cuda_dispatch.h torch/include/ATen/ops/_foreach_sinh_cuda_dispatch.h torch/include/ATen/ops/_foreach_sin_cuda_dispatch.h torch/include/ATen/ops/_foreach_sqrt_cuda_dispatch.h torch/include/ATen/ops/_foreach_sub_cuda_dispatch.h torch/include/ATen/ops/_foreach_tanh_cuda_dispatch.h torch/include/ATen/ops/_foreach_tan_cuda_dispatch.h torch/include/ATen/ops/_foreach_trunc_cuda_dispatch.h torch/include/ATen/ops/_foreach_zero_cuda_dispatch.h torch/include/ATen/ops/_fused_adamw_cuda_dispatch.h torch/include/ATen/ops/_fused_adam_cuda_dispatch.h torch/include/ATen/ops/_fused_dropout_cuda_dispatch.h torch/include/ATen/ops/_fused_moving_avg_obs_fq_helper_cuda_dispatch.h torch/include/ATen/ops/_fused_sdp_choice_cuda_dispatch.h torch/include/ATen/ops/_fused_sgd_cuda_dispatch.h torch/include/ATen/ops/_index_put_impl_cuda_dispatch.h torch/include/ATen/ops/_int_mm_cuda_dispatch.h torch/include/ATen/ops/_jagged_to_padded_dense_forward_cuda_dispatch.h torch/include/ATen/ops/_linalg_det_cuda_dispatch.h torch/include/ATen/ops/_linalg_eigh_cuda_dispatch.h torch/include/ATen/ops/_linalg_eigvals_cuda_dispatch.h torch/include/ATen/ops/_linalg_slogdet_cuda_dispatch.h torch/include/ATen/ops/_linalg_solve_ex_cuda_dispatch.h torch/include/ATen/ops/_linalg_svd_cuda_dispatch.h torch/include/ATen/ops/_local_scalar_dense_cuda_dispatch.h torch/include/ATen/ops/_logcumsumexp_cuda_dispatch.h torch/include/ATen/ops/_log_softmax_backward_data_cuda_dispatch.h torch/include/ATen/ops/_log_softmax_cuda_dispatch.h torch/include/ATen/ops/_make_per_channel_quantized_tensor_cuda_dispatch.h torch/include/ATen/ops/_make_per_tensor_quantized_tensor_cuda_dispatch.h torch/include/ATen/ops/_masked_scale_cuda_dispatch.h torch/include/ATen/ops/_masked_softmax_backward_cuda_dispatch.h torch/include/ATen/ops/_masked_softmax_cuda_dispatch.h torch/include/ATen/ops/_mixed_dtypes_linear_cuda_dispatch.h torch/include/ATen/ops/_native_batch_norm_legit_cuda_dispatch.h torch/include/ATen/ops/_native_multi_head_attention_cuda_dispatch.h torch/include/ATen/ops/_nested_compute_contiguous_strides_offsets_cuda_dispatch.h torch/include/ATen/ops/_nested_from_padded_cuda_dispatch.h torch/include/ATen/ops/_nested_tensor_from_mask_cuda_dispatch.h torch/include/ATen/ops/_nested_tensor_from_mask_left_aligned_cuda_dispatch.h torch/include/ATen/ops/_nested_view_from_buffer_cuda_dispatch.h torch/include/ATen/ops/_padded_dense_to_jagged_forward_cuda_dispatch.h torch/include/ATen/ops/_pdist_backward_cuda_dispatch.h torch/include/ATen/ops/_pdist_forward_cuda_dispatch.h torch/include/ATen/ops/_prelu_kernel_backward_cuda_dispatch.h torch/include/ATen/ops/_prelu_kernel_cuda_dispatch.h torch/include/ATen/ops/_reshape_alias_cuda_dispatch.h torch/include/ATen/ops/_sample_dirichlet_cuda_dispatch.h torch/include/ATen/ops/_scaled_dot_product_cudnn_attention_backward_cuda_dispatch.h torch/include/ATen/ops/_scaled_dot_product_cudnn_attention_cuda_dispatch.h torch/include/ATen/ops/_scaled_dot_product_efficient_attention_backward_cuda_dispatch.h torch/include/ATen/ops/_scaled_dot_product_efficient_attention_cuda_dispatch.h torch/include/ATen/ops/_scaled_dot_product_flash_attention_backward_cuda_dispatch.h torch/include/ATen/ops/_scaled_dot_product_flash_attention_cuda_dispatch.h torch/include/ATen/ops/_scaled_grouped_mm_cuda_dispatch.h torch/include/ATen/ops/_scaled_mm_cuda_dispatch.h torch/include/ATen/ops/_segment_reduce_backward_cuda_dispatch.h torch/include/ATen/ops/_slow_conv2d_backward_cuda_dispatch.h torch/include/ATen/ops/_slow_conv2d_forward_cuda_dispatch.h torch/include/ATen/ops/_softmax_backward_data_cuda_dispatch.h torch/include/ATen/ops/_softmax_cuda_dispatch.h torch/include/ATen/ops/_sparse_semi_structured_addmm_cuda_dispatch.h torch/include/ATen/ops/_sparse_semi_structured_apply_cuda_dispatch.h torch/include/ATen/ops/_sparse_semi_structured_apply_dense_cuda_dispatch.h torch/include/ATen/ops/_sparse_semi_structured_linear_cuda_dispatch.h torch/include/ATen/ops/_sparse_semi_structured_mm_cuda_dispatch.h torch/include/ATen/ops/_sparse_semi_structured_tile_cuda_dispatch.h torch/include/ATen/ops/_standard_gamma_cuda_dispatch.h torch/include/ATen/ops/_standard_gamma_grad_cuda_dispatch.h torch/include/ATen/ops/_thnn_fused_gru_cell_backward_cuda_dispatch.h torch/include/ATen/ops/_thnn_fused_gru_cell_cuda_dispatch.h torch/include/ATen/ops/_thnn_fused_lstm_cell_backward_impl_cuda_dispatch.h torch/include/ATen/ops/_thnn_fused_lstm_cell_cuda_dispatch.h torch/include/ATen/ops/_to_sparse_bsc_cuda_dispatch.h torch/include/ATen/ops/_to_sparse_bsr_cuda_dispatch.h torch/include/ATen/ops/_to_sparse_csc_cuda_dispatch.h torch/include/ATen/ops/_to_sparse_csr_cuda_dispatch.h torch/include/ATen/ops/_to_sparse_cuda_dispatch.h torch/include/ATen/ops/_to_sparse_semi_structured_cuda_dispatch.h torch/include/ATen/ops/_transformer_encoder_layer_fwd_cuda_dispatch.h torch/include/ATen/ops/_transform_bias_rescale_qkv_cuda_dispatch.h torch/include/ATen/ops/_triton_multi_head_attention_cuda_dispatch.h torch/include/ATen/ops/_triton_scaled_dot_attention_cuda_dispatch.h torch/include/ATen/ops/_unique2_cuda_dispatch.h torch/include/ATen/ops/_unique_cuda_dispatch.h torch/include/ATen/ops/_upsample_bicubic2d_aa_backward_cuda_dispatch.h torch/include/ATen/ops/_upsample_bicubic2d_aa_cuda_dispatch.h torch/include/ATen/ops/_upsample_bilinear2d_aa_backward_cuda_dispatch.h torch/include/ATen/ops/_upsample_bilinear2d_aa_cuda_dispatch.h torch/include/ATen/ops/_upsample_nearest_exact1d_backward_cuda_dispatch.h torch/include/ATen/ops/_upsample_nearest_exact1d_cuda_dispatch.h torch/include/ATen/ops/_upsample_nearest_exact2d_backward_cuda_dispatch.h torch/include/ATen/ops/_upsample_nearest_exact2d_cuda_dispatch.h torch/include/ATen/ops/_upsample_nearest_exact3d_backward_cuda_dispatch.h torch/include/ATen/ops/_upsample_nearest_exact3d_cuda_dispatch.h torch/include/ATen/ops/_use_cudnn_ctc_loss_cuda_dispatch.h torch/include/ATen/ops/_validate_compressed_sparse_indices_cuda_dispatch.h torch/include/ATen/ops/_weight_int4pack_mm_cuda_dispatch.h torch/include/ATen/ops/_weight_norm_interface_backward_cuda_dispatch.h torch/include/ATen/ops/_weight_norm_interface_cuda_dispatch.h torch/include/c10/cuda torch/include/c10/cuda/CUDAAlgorithm.h torch/include/c10/cuda/CUDAAllocatorConfig.h torch/include/c10/cuda/CUDACachingAllocator.h torch/include/c10/cuda/CUDADeviceAssertion.h torch/include/c10/cuda/CUDADeviceAssertionHost.h torch/include/c10/cuda/CUDAException.h torch/include/c10/cuda/CUDAFunctions.h torch/include/c10/cuda/CUDAGraphsC10Utils.h torch/include/c10/cuda/CUDAGuard.h torch/include/c10/cuda/CUDAMacros.h torch/include/c10/cuda/CUDAMathCompat.h torch/include/c10/cuda/CUDAMiscFunctions.h torch/include/c10/cuda/CUDAStream.h torch/include/c10/cuda/driver_api.h torch/include/c10/cuda/impl torch/include/c10/cuda/impl/CUDAGuardImpl.h torch/include/c10/cuda/impl/CUDATest.h torch/include/c10/cuda/impl/cuda_cmake_macros.h torch/include/c10/cuda/test torch/include/c10/cuda/test/impl torch/include/torch/csrc/api/include/torch/cuda.h torch/include/torch/csrc/cuda torch/include/torch/csrc/cuda/comm.h torch/include/torch/csrc/cuda/CUDAPluggableAllocator.h torch/include/torch/csrc/cuda/device_set.h torch/include/torch/csrc/cuda/Event.h torch/include/torch/csrc/cuda/GdsFile.h torch/include/torch/csrc/cuda/memory_snapshot.h torch/include/torch/csrc/cuda/Module.h torch/include/torch/csrc/cuda/nccl.h torch/include/torch/csrc/cuda/python_comm.h torch/include/torch/csrc/cuda/python_nccl.h torch/include/torch/csrc/cuda/shared torch/include/torch/csrc/cuda/Stream.h torch/include/torch/csrc/cuda/THCP.h torch/include/torch/csrc/distributed/c10d/cuda torch/include/torch/csrc/distributed/c10d/cuda/cutlass torch/include/torch/csrc/distributed/c10d/cuda/cutlass/gemm torch/include/torch/csrc/distributed/c10d/cuda/cutlass/gemm/kernel torch/include/torch/csrc/distributed/c10d/cuda/utils.hpp torch/include/torch/csrc/inductor/aoti_include/cuda.h torch/include/torch/csrc/inductor/aoti_runner/model_container_runner_cuda.h torch/include/torch/csrc/inductor/aoti_runtime/utils_cuda.h torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.h torch/include/torch/csrc/inductor/cpp_wrapper/cuda.h torch/include/torch/csrc/inductor/cpp_wrapper/device_internal/cuda.h torch/include/torch/csrc/jit/codegen/cuda torch/include/torch/csrc/jit/codegen/cuda/interface.h torch/include/torch/csrc/jit/codegen/fuser/cuda torch/include/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.h torch/include/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h torch/include/torch/csrc/jit/cuda torch/include/torch/csrc/jit/cuda/cuda.h torch/include/torch/csrc/jit/tensorexpr/cuda_codegen.h torch/include/torch/csrc/jit/tensorexpr/cuda_random.h torch/include/torch/csrc/utils/cuda_enabled.h torch/lib/c10_cuda.dll torch/lib/c10_cuda.lib torch/lib/cudart64_12.dll torch/lib/torch_cuda.dll torch/lib/torch_cuda.lib torch/share/cmake/Caffe2/public/cuda.cmake

bhvieira avatar Oct 30 '25 14:10 bhvieira

The cudnn dlls

torch/lib/cudnn64_9.dll torch/lib/cudnn_adv64_9.dll torch/lib/cudnn_cnn64_9.dll torch/lib/cudnn_engines_precompiled64_9.dll torch/lib/cudnn_engines_runtime_compiled64_9.dll torch/lib/cudnn_graph64_9.dll torch/lib/cudnn_heuristic64_9.dll torch/lib/cudnn_ops64_9.dll

bhvieira avatar Oct 30 '25 14:10 bhvieira

Interestingly we see: torch/lib/cudnn_graph64_9.dll. Wondering why it's not found. I'll need to log into a windows machine to further debug with something like https://www.dependencywalker.com but I don't currently have a working windows machine.

We need to figure out why it's not looking for packages within torch/lib/... and where it's looking instead.

dfalbel avatar Oct 30 '25 14:10 dfalbel

It definitely has to do with the search path. I ran @bhvieira's diagnostic code in RStudio with a crash of the R session. Then I just ran R.exe in a commandline terminal to see the error causing the abort. As before cudnn_graph64_9.dll was not found. I haven't updated to CUDA 12.8 / torch 0.16.3, I am still with CUDA 12.4 torch 0.14.2. I further had a temp directory with a manual download of the NVIDIA archive. I cded there to compare MD5 for the installed dlls and the downloaded ones, tried a manual copy of the dlls. Then I started an R session and ran the test successfully. The point is that probably the cudnn_graph64_9.dll within the temp_dir/bin, where I had started the R session, was loaded. When I start the R-session within any other directory code execution fails as described. So the dll is loaded from the current directory, but is not found and loaded from torch/lib. Maybe something must explicitly be added to libPaths or the system global path variable.

Bernie-K avatar Nov 12 '25 16:11 Bernie-K