sparse icon indicating copy to clipboard operation
sparse copied to clipboard

Doc: Show stats comparing to numpy

Open JonathanWoollett-Light opened this issue 6 months ago • 7 comments

What type of report is this?

Improvement

Please describe the issue.

It would be good if a GitHub action ran a test that generated plots comparing performance to numpy, these could then be pushed to a GitHub page and viewable.

If you have a suggestion on how it should be, add it below.

An example is that:

Image

Which shows at which density sparse is more efficient for different numbers of dimensions.

Which I generated with:

import sparse
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm


def test_boolean():
    # Generate density values (11 points between 0.00-0.01)
    densities = np.linspace(0.00, 1.00, num=100)
    dims = range(1, 5)
    size = 2000

    sparse_mem: list[list[int]] = []
    numpy_mem: list[list[int]] = []
    for dim in dims:  # Dimensions 1-3 (0D removed)
        print(f"dim: {dim}")

        dim_size = int(float(size) ** (1 / float(dim)))
        sparse_mem_dim: list[int] = []
        numpy_mem_dim: list[int] = []
        for density in tqdm(densities):

            # Sparse array memory
            sparse_arr = sparse.random([dim_size for _ in range(dim)], density=density)
            sparse_mem_dim.append(sparse_arr.nbytes)

            # Dense array memory
            dense_arr = np.empty([dim_size for _ in range(dim)])
            numpy_mem_dim.append(dense_arr.nbytes)
        sparse_mem.append(sparse_mem_dim)
        numpy_mem.append(numpy_mem_dim)

    # Plotting
    plt.figure(figsize=(10, 6))
    for i, d in enumerate(dims):
        plt.plot(densities, sparse_mem[i], "o", alpha=0.5, label=f"Sparse {d}D ")
        plt.plot(densities, numpy_mem[i], "o", alpha=0.5, label=f"Numpy {d}D")
    plt.xlabel("Density")
    plt.ylabel("Memory Usage (bytes)")
    plt.title(f"Memory Usage vs Density for nD Arrays")
    plt.legend()
    plt.grid(True)
    plt.savefig(f"memory_usage.png")
    plt.close()

JonathanWoollett-Light avatar Jun 28 '25 11:06 JonathanWoollett-Light

Normally; I'd be willing to accept this feature request. However, it turns out to be tricky for two reasons:

  • Measuring time isn't stable on a CI system; usually CPU cycles is much more stable
  • ReadTheDocs has an upper limit on build times. We usually avoid high-load generations as we're already close to this limit

I'd be happy to accept a PR adding this static image to the docs though. Let me know if that's something you're interested in.

hameerabbasi avatar Jun 28 '25 11:06 hameerabbasi

I'd be interested (as a prospective sparse user) to also see time (and memory) comparisons to scipy.sparse for cases where the functionality overlaps. In my preliminary testing, sparse is 10x slower than scipy for most operations (comparing GCXS to csr) but I'm not sure if that's indicative of something wrong on my install or it is typical. A plot in the docs showing what I would expect would be helpful.

tgbrooks avatar Sep 07 '25 01:09 tgbrooks

I continued this a little as I do think its worth committing some of this, but I've encountered some suspscious results (making me think I've made a mistake) in timing matmul e.g.

Image

The code is here https://github.com/JonathanWoollett-Light/testing_pydata_sparse I would appreciate if anyone could check if its right or suggest a fix.

The memory results seem the same as before

Image

JonathanWoollett-Light avatar Sep 25 '25 16:09 JonathanWoollett-Light

Hi @JonathanWoollett-Light, thanks for opening this conversation. I looked at your tests and legends for timings are off.

https://github.com/JonathanWoollett-Light/testing_pydata_sparse/blob/e24d477ccd45668d8c7d3041626c6ab42eb9ce72/src/testing_pydata_sparse/init.py#L97

But return statement returns them in the order:

return [sparse_arr_mem, numpy_arr_mem, np_matmul_elapsed, sparse_matmul_elapsed]

and

https://github.com/JonathanWoollett-Light/testing_pydata_sparse/blob/e24d477ccd45668d8c7d3041626c6ab42eb9ce72/src/testing_pydata_sparse/init.py#L115

So they are basically reversed.

Numpy is expected to be faster, there is undergoing performance enhancements for pydata/sparse (see the discussion). @hameerabbasi Would be a better judge.

The memory graph shows pretty great results and pydata/sparse outperforms memory below 20% (approx.) density, which again is expected behavior. We can definitely put the graphs in the docs for quick comparison. Please also include 1-D and 2-D scipy.sparse comparisons here.

prady0t avatar Nov 06 '25 05:11 prady0t

@prady0t Fixed timing and added scipy.sparse (only added 2-D since even 1-D would just be 2-D but with a 1 length dimension).

Image Image

JonathanWoollett-Light avatar Nov 10 '25 01:11 JonathanWoollett-Light

Looks good. Thanks for the efforts! Can you also make a PR adding these images to docs.

prady0t avatar Nov 10 '25 04:11 prady0t

In addition, the script to generate these plots would also be welcome in the same PR.

hameerabbasi avatar Nov 10 '25 09:11 hameerabbasi