nalgebra icon indicating copy to clipboard operation
nalgebra copied to clipboard

Nalgebra seems to run an order of magnitude slower than Numpy, with the same backend bindings

Open rossamurphy opened this issue 11 months ago • 4 comments

Hi,

I wanted to try out nalgebra and see how it compares to very vanilla operations in numpy. I think I've implemented everything the 'right' way according to the documentation, but I'm really struggling with the performance here:

My rust code here:

use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;

extern crate nalgebra as na;
extern crate nalgebra_lapack;

use std::time::Instant;

fn main() {

    let start = Instant::now();
    let rows = 2000;
    let cols = 2000;

    // Pre-allocate the vector with exact capacity
    let mut data = Vec::with_capacity(rows * cols);

    // Initialize random number generator and distribution
    let uniform = Uniform::new(0.0, 1.0);

    // Generate random numbers in parallel
    data.par_extend(
        (0..rows * cols)
            .into_par_iter()
            .map(|_|
                {
                    let mut local_rng = thread_rng(); // Create a new RNG for each thread
                    uniform.sample(&mut local_rng)
                })
    );

    // Create matrix from pre-generated data
    let a = na::DMatrix::from_vec(rows, cols, data);

    let creation_time = start.elapsed();
    println!("Matrix creation: {:?}", creation_time);
    println!("Matrix dimensions: {} x {}", a.nrows(), a.ncols());

    // Compute A^T A directly without storing the transpose
    let mult_start = Instant::now();
    let c = &a * &a.transpose();  // Use references to avoid unnecessary copies

    println!("Result dimensions: {} x {}", c.nrows(), c.ncols());
    let mult_time = mult_start.elapsed();
    println!("Matrix multiplication: {:?}", mult_time);

    // Use symmetric eigenvalue computation since C = A^T A is symmetric
    let eig_start = Instant::now();
    let _eigvals = na::linalg::SymmetricEigen::new(c.clone()).eigenvalues;
    let eig_time = eig_start.elapsed();
    println!("Eigenvalues computation: {:?}", eig_time);

    // Use symmetric SVD since C is symmetric positive semidefinite
    let svd_start = Instant::now();
    let svd = na::linalg::SVD::new(c, true, true);
    let _singular_values = svd.singular_values;
    let svd_time = svd_start.elapsed();
    println!("SVD calculation: {:?}", svd_time);

    println!("Total time: {:?}", start.elapsed());
}

my cargo.toml:

[package]
name = "rust"
version = "0.1.0"
edition = "2021"

[dependencies]
nalgebra = "0.33.2"
nalgebra-lapack = { version = "0.25.0", default-features = false, features = ["accelerate"] }
rand = "0.8.5"
rayon = "1.10.0"

and my build.rs

fn main() {
    #[cfg(target_os = "macos")]
    {
        println!("cargo:rustc-link-lib=framework=Accelerate");
    }
}

and, for comparison, my Python code, a super simple script using numpy:

import numpy as np
import time

t1 = time.time()
a = np.random.rand(2_000, 2_000)
a.shape
print(a)
t2 = time.time()
b = a.T @ a
t3 = time.time()
# calculate the svd of b
result = np.linalg.eigvals(b)
# print(result)
t4 = time.time()
t5 = time.time()
result = np.linalg.svd(b)
t6 = time.time()
# cholesky decomp
result = np.linalg.cholesky(b)
t7 = time.time()
# np.show_config()
print(f"Total time: {(t6 - t1)*1e3:.1f} miliseconds")
print(f"Matrix creation: {(t2 - t1)*1e3:.1f} miliseconds")
print(f"Matrix multiplication: {(t3 - t2)*1e3:.1f} miliseconds")
print(f"Eigenval calculation: {(t4 - t3)*1e3:.1f} miliseconds")
print(f"SVD calculation: {(t6 - t5)*1e3:.1f} miliseconds")
print(f"Cholesky decomposition: {(t7 - t6)*1e3:.1f} miliseconds")

I find that when I run both, I get these results:

RUST:

❯ cargo run --release Finished release profile [optimized] target(s) in 0.08s Running target/release/rust Matrix creation: 9.157083ms Matrix dimensions: 2000 x 2000 Result dimensions: 2000 x 2000 Matrix multiplication: 364.901042ms Eigenvalues computation: 4.586041791s SVD calculation: 14.505883666s Total time: 19.466049208s

and

PYTHON:

Total time: 3177.9 miliseconds Matrix creation: 17.5 miliseconds Matrix multiplication: 22.2 miliseconds Eigenval calculation: 1377.0 miliseconds SVD calculation: 1761.1 miliseconds Cholesky decomposition: 27.5 miliseconds

You can see that Python is more than 10x faster at matmul, ~4x faster at calculating eigenvalues, and an order of magnitude faster at computing the SVD.

Please could someone tell me where I'm going wrong here?

Some further Info:

  1. I'm running in 'release' mode.
  2. I know my binary is linked to Apple accelerate because: ❯ otool -L target/release/rust target/release/rust: /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (compatibility version 1.0.0, current version 4.0.0) /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
  3. I've also tried with the default openblas setting, and performance is even worse.

If someone could please help me out / point me to where I'm going wrong that would be great. I am aware Python is going to be hard to beat here (because this Python code is basically just running C / Fortran optimised for Apple chips), but, I would expect at the very least that the Rust would be on-par with the performance of really-simple Python, given that both are targeting the same backend (Apple accelerate).

Thanks!

rossamurphy avatar Dec 27 '24 13:12 rossamurphy

You're pulling in nalgebra_lapack, but you don't seem to actually be using it for anything.

Ralith avatar Dec 27 '24 21:12 Ralith

why you are printing results during timing? io can take up more time than you think. besides you should use criterion.rs rather than timing things yourself.

Da1sypetals avatar Dec 28 '24 14:12 Da1sypetals

@Ralith ah, thanks! I hadn't realised that the nalgebra lapack crate actually had its own, other methods for computing the decompositions. I naively thought by importing it that the main nalgebra crate would have access to the speed-ups.

I've tried again now:

use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;

extern crate nalgebra as na;
extern crate nalgebra_lapack as nl;

use std::time::Instant;

fn main() {

    let start = Instant::now();
    let rows = 2000;
    let cols = 2000;

    // Pre-allocate the vector with exact capacity
    let mut data = Vec::with_capacity(rows * cols);

    // Initialize random number generator and distribution
    let uniform = Uniform::new(0.0, 1.0);

    // Generate random numbers in parallel
    data.par_extend(
        (0..rows * cols)
            .into_par_iter()
            .map(|_|
                {
                    let mut local_rng = thread_rng(); // Create a new RNG for each thread
                    uniform.sample(&mut local_rng)
                })
    );

    // Create matrix from pre-generated data
    let a = na::DMatrix::from_vec(rows, cols, data);

    let creation_time = start.elapsed();
    println!("Matrix creation: {:?}", creation_time);
    println!("Matrix dimensions: {} x {}", a.nrows(), a.ncols());

    // Compute A^T A directly without storing the transpose
    let mult_start = Instant::now();
    let c = &a * &a.transpose();  

    println!("Result dimensions: {} x {}", c.nrows(), c.ncols());
    let mult_time = mult_start.elapsed();
    println!("Matrix multiplication: {:?}", mult_time);
    
    // Use symmetric eigenvalue computation since C = A^T A is symmetric
    let eig_start = Instant::now();
    let _eigvals = na::linalg::SymmetricEigen::new(c.clone()).eigenvalues;
    let eig_time = eig_start.elapsed();
    println!("Eigenvalues computation native rust: {:?}", eig_time);
    
    // Symmetric eigen method from nalgebra lapack fails here due to non-convergence.
    // which is quite strange ...
    // calculating in the general case here and just returning the real eigenvalues 
    let eig_start = Instant::now();
    let _eigen_vals = nl::Eigen::new(c.clone(), false, false).unwrap().eigenvalues_re;
    let eig_time = eig_start.elapsed();
    println!("Eigenvalues computation nalgebra lapack: {:?}", eig_time);
    
    // Use symmetric SVD since C is symmetric positive semidefinite
    let svd_start = Instant::now();
    // let svd = na::linalg::SVD::new(c.clone(), true, true);
    let _svd = na::SVD::new(c.clone(),true, true).singular_values;
    let svd_time = svd_start.elapsed();
    println!("SVD calculation native rust: {:?}", svd_time);
    
    // Use symmetric SVD since C is symmetric positive semidefinite
    let svd_start = Instant::now();
    let svd = nl::SVD::new(c.clone());
    let _singular_values = svd.unwrap().singular_values;
    let svd_time = svd_start.elapsed();
    println!("SVD calculation nalgebra lapack: {:?}", svd_time);

    println!("Total time: {:?}", start.elapsed());
}

This gives much better results:

Matrix creation: 9.137ms Matrix dimensions: 2000 x 2000 Result dimensions: 2000 x 2000 Matrix multiplication: 362.080209ms Eigenvalues computation native rust: 4.609710542s Eigenvalues computation nalgebra lapack: 1.521910792s SVD calculation native rust: 13.960452334s SVD calculation nalgebra lapack: 1.578839s Total time: 22.042224s

@Ralith is there any scope for enabling the nalgebra-lapack speedup as a feature of the main nalgebra crate? i.e. if you have an accelerator registered, the operation will be accelerated, and if you don't, it won't. That way, there could exist one standard way of doing the decompositions, rather than two sort of similar ways, depending on the underlying architecture.

@Da1sypetals thanks for the recommendation! Criterion looks very useful and I'll use it in future. In this case however, I just want to log to the std out in as quick and easy a means possible. Given the time difference between the operations (accelerated and not) is in the order of multiple seconds, I think the time incurred to log to std out is immaterial in this example.

rossamurphy avatar Dec 30 '24 11:12 rossamurphy

Also, FYI, I achieved some pretty good results on this particular benchmark by using this great port of libtorch: https://github.com/LaurentMazare/tch-rs . Admittedly, if using in production, the final binary will end up a lot larger (if you decide to statically link torch), but it enabled me to take full advantage of libtorch bindings for my device.

use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;
use std::time::Instant;
use tch::{Device, Tensor};

fn main() {
    let device = Device::Mps;
    println!("Device: {:?}", device);
    let start = Instant::now();
    let rows = 2000;
    let cols = 2000;

    // Pre-allocate the vector with exact capacity
    let mut data = Vec::with_capacity(rows * cols);

    // Initialize random number generator and distribution
    let uniform = Uniform::new(0.0f32, 1.0f32); // Changed to f32

    // Generate random numbers in parallel
    data.par_extend((0..rows * cols).into_par_iter().map(|_| {
        let mut local_rng = thread_rng(); // Create a new RNG for each thread
        uniform.sample(&mut local_rng)
    }));

    // t.print()
    let t = Tensor::from_slice2(&data.chunks(cols).collect::<Vec<_>>()).to_device(device);
    println!("Time taken to create matrix: {:?}", start.elapsed());

    let start = Instant::now();
    let transpose_t = t.transpose(0, 1);
    println!("Time taken to transpose matrix: {:?}", start.elapsed());

    let start = Instant::now();
    let psd_t = t.matmul(&transpose_t);
    println!("Time taken to matmul matrix: {:?}", start.elapsed());

    let start = Instant::now();
    let _eig = psd_t.linalg_eigvals();
    println!("Time taken to calculate eigenvalues: {:?}", start.elapsed());

    // let start = Instant::now();
    // let _eig = psd_t.cholesky(true);
    // println!(
    //     "Time taken to calculate cholesky decomp: {:?}",
    //     start.elapsed()
    // );

    let start = Instant::now();
    let _svd = t.svd(true, true);
    println!("Time taken to calculate SVD: {:?}", start.elapsed());
    // t.print();
}

Results:

Device: Mps Time taken to create matrix: 277.979625ms Time taken to transpose matrix: 541.417µs Time taken to matmul matrix: 21.853542ms Time taken to calculate eigenvalues: 891.593208ms Time taken to calculate SVD: 592.646416ms

Doing it in torch also has the handy nicety that you can prototype stuff in torch in Python, and then transfer it pretty trivially to tch-rs in production (as it's all just torch at the end of the day).

rossamurphy avatar Dec 30 '24 11:12 rossamurphy

numpy uses BLAS, which is basically a really optimized low-level API for linear algebra that uses multithreading to speed up matrix operations. On the other hand, nalgebra runs on a single thread, which might explain the time difference

Cupcake6 avatar Aug 02 '25 12:08 Cupcake6