torch icon indicating copy to clipboard operation
torch copied to clipboard

torch fails on new Mac M3 architecture

Open gilbertocamara opened this issue 1 year ago • 13 comments

Dear @dfalbel I have bought a new MacBook Air with the M3 chip which has 8 CPUs, 10 GPUs and 16GB integrated memory. My R torch apps are crashing. I have put together a MWE which works on all other architectures, including in MacBook Air M1 and MacMini. The OS is the same (Sonoma 14.5). The MWE follows:

# ==== MWE

# Download the training samples
rds_file <- "https://raw.githubusercontent.com/e-sensing/sitsdata/master/inst/extdata/torch/train_samples.rds?raw=true"
dest_file <- paste0(tempdir(),"/train_samples.rds")
download.file(rds_file,
              destfile = dest_file,
              method = "curl")
train_samples <- readRDS(dest_file)

# Sample labels
labels <- c("Cerrado", "Forest", "Pasture", "Soy_Corn")

# Create numeric labels vector
code_labels <- seq_along(labels)
names(code_labels) <- labels

# Split the data into training and validation data sets
# Create partitions different splits of the input data
frac <- 0.2
train_samples <- dplyr::group_by(train_samples, .data[["label"]])
test_samples <- train_samples |>
    dplyr::slice_sample(prop = frac) |>
    dplyr::ungroup()
    
# Remove the lines used for validation
sel <- !train_samples[["sample_id"]] %in% test_samples[["sample_id"]]
train_samples <- train_samples[sel, ]

# Shuffle the data
train_samples <- train_samples[sample(nrow(train_samples), nrow(train_samples)), ]
test_samples <- test_samples[sample(nrow(test_samples), nrow(test_samples)), ]

# Organize data for model training
train_x <- as.matrix(train_samples[, -2:0])
train_y <- unname(code_labels[train_samples[["label"]]])

# Create the test data
test_x <- as.matrix(test_samples[, -2:0])
test_y <- unname(code_labels[test_samples[["label"]]])

# Set torch seed
torch::torch_manual_seed(sample.int(10^5, 1))

# Avoid a global variable for 'self'
self <- NULL

# function to create a simple sequential NN module
.torch_linear_relu_dropout <- torch::nn_module(
    classname = "torch_linear_batch_norm_relu_dropout",
    initialize = function(input_dim,
                          output_dim,
                          dropout_rate) {
        self$block <- torch::nn_sequential(
            torch::nn_linear(input_dim, output_dim),
            torch::nn_relu(),
            torch::nn_dropout(dropout_rate)
        )
    },
    forward = function(x) {
        self$block(x)
    }
)

# Define the MLP architecture
mlp_model <- torch::nn_module(
    initialize = function(num_pred, layers, dropout_rates, y_dim) {
        tensors <- list()
        # input layer
        tensors[[1]] <- .torch_linear_relu_dropout(
            input_dim = num_pred,
            output_dim = 512,
            dropout_rate = 0.40
        )
        # output layer
        tensors[[length(tensors) + 1]] <-
            torch::nn_linear(layers[length(layers)], y_dim)
        # add softmax tensor
        tensors[[length(tensors) + 1]] <- torch::nn_softmax(dim = 2)
        # create a sequential module that calls the layers in the same
        # order.
        self$model <- torch::nn_sequential(!!!tensors)
    },
    forward = function(x) {
        self$model(x)
    }
)
# Train the model using luz

torch_model <- luz::setup(
    module = mlp_model,
    loss = torch::nn_cross_entropy_loss(),
    metrics = list(luz::luz_metric_accuracy()),
    optimizer = torch::optim_adamw,
)
torch_model <- luz::set_hparams(
    torch_model,
    num_pred = ncol(train_x),
    layers = 512,
    dropout_rates = 0.3,
    y_dim = length(code_labels)
)
torch_model <- luz::set_opt_hparams(
    torch_model,
    lr = 0.001,
    eps = 1e-08,
    weight_decay = 1.0e-06
)
torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE
)

The error occurs in the luz::fit function. Inside RStudio, the code gets stuck and then RStudio asks to restart R. When running R from the terminal, the output is:

 *** caught bus error ***
address 0x16daa0000, cause 'invalid alignment'

 *** caught segfault ***
address 0x9, cause 'invalid permissions'
zsh: segmentation fault  R

The sessionInfo() output is as follows:


R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         zeallot_0.1.0    
 [5] rlang_1.1.3       processx_3.8.4    generics_0.1.3    torch_0.12.0.9000
 [9] coro_1.0.4        glue_1.7.0        bit_4.0.5         prettyunits_1.2.0
[13] luz_0.4.0         ps_1.7.6          hms_1.1.3         fansi_1.0.6      
[17] tibble_3.2.1      progress_1.2.3    lifecycle_1.0.4   compiler_4.4.0   
[21] dplyr_1.1.4       fs_1.6.4          Rcpp_1.0.12       pkgconfig_2.0.3  
[25] rstudioapi_0.16.0 R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
[29] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    tools_4.4.0      
[33] bit64_4.0.5     

gilbertocamara avatar May 18 '24 21:05 gilbertocamara

Can you show me the output of torch::install_torch(reinstall = TRUE) ? Also, I'assuming it doesnt fail if you run eg: torch_randn(10)`?

dfalbel avatar May 20 '24 17:05 dfalbel

Sure!

torch::install_torch(reinstall = TRUE)
trying URL 'https://github.com/mlverse/libtorch-mac-m1/releases/download/LibTorch-for-R/libtorch-v2.0.1.zip'
Content type 'application/octet-stream' length 49631992 bytes (47.3 MB)
==================================================
downloaded 47.3 MB

trying URL 'https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.12.0.9000+cpu+arm64-Darwin.zip'
Content type 'application/zip' length 3602457 bytes (3.4 MB)
==================================================
downloaded 3.4 MB

✔ torch dependencies have been installed.
ℹ You must restart your session to use torch correctly.

Running a simple command such as torch_randn(10) works.

torch::torch_randn(10)
torch_tensor
 0.8753
 0.9061
-1.8905
-0.2683
-0.4204
-0.3306
 1.1119
 0.0052
 0.3246
-0.2530
[ CPUFloatType{10} ]

torch also can access the M3 MPS. The following works.

x <- torch::torch_randn(10, 10)$to(device="mps")
y <- torch::torch_randn(10, 10)$to(device="mps")

torch::torch_mm(x, y)

The problems appear on the luz::fit() function. We compiled the lantern library from source, and tried to install it as follows.

# compiled lantern from source and configured env variables as follows
devtools::install(build = FALSE)
Running /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL \
  /Users/gilberto/torch --install-tests 
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library’
* installing *source* package ‘torch’ ...
** using staged installation
CMAKE_FLAGS: 
** libs
con compilatore C++: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
con SDK: ‘MacOSX14.4.sdk’
*** Building lantern!
mkdir -p ../build-lantern
cd ../build-lantern && cmake ../src/lantern -DCMAKE_INSTALL_PREFIX=/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/00LOCK-torch/00new/torch -DCMAKE_INSTALL_MESSAGE="LAZY"  && cmake --build . --target install --config Release
### Lots of output...
-- Build files have been written to: /Users/gilberto/torch/build-lantern

## We then configured the env variables
Sys.setenv(LANTERN_URL="/Users/gilberto/torch/build-lantern")
Sys.setenv(TORCH_URL="/Users/gilberto/torch/build-lantern/libtorch")
## We then tried to install torch after this, but if falis

Either there is a problem with the lantern code when using M3, or we have failed to install correctly after compiling from source.

gilbertocamara avatar May 20 '24 17:05 gilbertocamara

You might want to try setting the env var BUILD_LANTERN=1 then running remotes::install_github("mlverse/torch") to build lantern from source. Although, I don't think lantern is the culprit here, as it's just a relatively thin wrapper around LibTorch. You might also need to build LibTorch from source.

dfalbel avatar May 20 '24 18:05 dfalbel

Also, have you tried installing pre-built binaries from with eg:

kind <- "cpu"
version <- "0.12.0.9000"
options(repos = c(
  torch = sprintf("https://torch-cdn.mlverse.org/packages/%s/%s/", kind, version),
  CRAN = "https://cloud.r-project.org" # or any other from which you want to install the other R dependencies.
))
install.packages("torch", type = "binary")

dfalbel avatar May 20 '24 18:05 dfalbel

Thanks! I have tried, but failed.

gilbertocamara avatar May 20 '24 18:05 gilbertocamara

Can you also try disabling MPS on luz, just so we can narrow a little more the problem.

You can do something like:

torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE,
  accelerator = accelerator(cpu = TRUE)
)

dfalbel avatar May 20 '24 19:05 dfalbel

Works!!! Can we now make luz work on MPS?

gilbertocamara avatar May 20 '24 19:05 gilbertocamara

I think we will need to figure out why torch fails on M3 + MPS for that model. I believe it's possible that you will need to build LibTorch from source to fix this issue.

dfalbel avatar May 20 '24 20:05 dfalbel

How do I build libtorch and liblantern from source?

gilbertocamara avatar May 20 '24 20:05 gilbertocamara

To build LibTorch from source, you can follow instructions the steps in this workflow file:

https://github.com/mlverse/libtorch-mac-m1/blob/main/.github/workflows/libtorch.yaml

Then copy the libtorch files into src/lantern/build and run load_all or dev tools::install with BUILD_LANTERN=1 set.

dfalbel avatar May 20 '24 20:05 dfalbel

Thanks!! I will try

gilbertocamara avatar May 20 '24 20:05 gilbertocamara

Dear @dfalbel we tried to build torch from source, but it did not work on Mac M3 chip. Looking at the pytorch github, other developers are having similar problems with the new M3 chip. Please see the following issue:

https://github.com/pytorch/pytorch/issues/125803

gilbertocamara avatar May 21 '24 17:05 gilbertocamara

Hello. I had a similar issue, but after I upgraded to macOS Sonoma 14.4.1 on a Mac M2. I posted on the Luz GitHub, but was happy to see some discussion here.

https://github.com/mlverse/luz/issues/143

DenaJGibbon avatar May 25 '24 11:05 DenaJGibbon

This bug has been solved with the latest version of torch (0.15.0) and luz (0.4.0)

gilbertocamara avatar Jun 27 '25 10:06 gilbertocamara