Doc: Show stats comparing to numpy
What type of report is this?
Improvement
Please describe the issue.
It would be good if a GitHub action ran a test that generated plots comparing performance to numpy, these could then be pushed to a GitHub page and viewable.
If you have a suggestion on how it should be, add it below.
An example is that:
Which shows at which density sparse is more efficient for different numbers of dimensions.
Which I generated with:
import sparse
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
def test_boolean():
# Generate density values (11 points between 0.00-0.01)
densities = np.linspace(0.00, 1.00, num=100)
dims = range(1, 5)
size = 2000
sparse_mem: list[list[int]] = []
numpy_mem: list[list[int]] = []
for dim in dims: # Dimensions 1-3 (0D removed)
print(f"dim: {dim}")
dim_size = int(float(size) ** (1 / float(dim)))
sparse_mem_dim: list[int] = []
numpy_mem_dim: list[int] = []
for density in tqdm(densities):
# Sparse array memory
sparse_arr = sparse.random([dim_size for _ in range(dim)], density=density)
sparse_mem_dim.append(sparse_arr.nbytes)
# Dense array memory
dense_arr = np.empty([dim_size for _ in range(dim)])
numpy_mem_dim.append(dense_arr.nbytes)
sparse_mem.append(sparse_mem_dim)
numpy_mem.append(numpy_mem_dim)
# Plotting
plt.figure(figsize=(10, 6))
for i, d in enumerate(dims):
plt.plot(densities, sparse_mem[i], "o", alpha=0.5, label=f"Sparse {d}D ")
plt.plot(densities, numpy_mem[i], "o", alpha=0.5, label=f"Numpy {d}D")
plt.xlabel("Density")
plt.ylabel("Memory Usage (bytes)")
plt.title(f"Memory Usage vs Density for nD Arrays")
plt.legend()
plt.grid(True)
plt.savefig(f"memory_usage.png")
plt.close()
Normally; I'd be willing to accept this feature request. However, it turns out to be tricky for two reasons:
- Measuring time isn't stable on a CI system; usually CPU cycles is much more stable
- ReadTheDocs has an upper limit on build times. We usually avoid high-load generations as we're already close to this limit
I'd be happy to accept a PR adding this static image to the docs though. Let me know if that's something you're interested in.
I'd be interested (as a prospective sparse user) to also see time (and memory) comparisons to scipy.sparse for cases where the functionality overlaps. In my preliminary testing, sparse is 10x slower than scipy for most operations (comparing GCXS to csr) but I'm not sure if that's indicative of something wrong on my install or it is typical. A plot in the docs showing what I would expect would be helpful.
I continued this a little as I do think its worth committing some of this, but I've encountered some suspscious results (making me think I've made a mistake) in timing matmul e.g.
The code is here https://github.com/JonathanWoollett-Light/testing_pydata_sparse I would appreciate if anyone could check if its right or suggest a fix.
The memory results seem the same as before
Hi @JonathanWoollett-Light, thanks for opening this conversation. I looked at your tests and legends for timings are off.
https://github.com/JonathanWoollett-Light/testing_pydata_sparse/blob/e24d477ccd45668d8c7d3041626c6ab42eb9ce72/src/testing_pydata_sparse/init.py#L97
But return statement returns them in the order:
return [sparse_arr_mem, numpy_arr_mem, np_matmul_elapsed, sparse_matmul_elapsed]
and
https://github.com/JonathanWoollett-Light/testing_pydata_sparse/blob/e24d477ccd45668d8c7d3041626c6ab42eb9ce72/src/testing_pydata_sparse/init.py#L115
So they are basically reversed.
Numpy is expected to be faster, there is undergoing performance enhancements for pydata/sparse (see the discussion). @hameerabbasi Would be a better judge.
The memory graph shows pretty great results and pydata/sparse outperforms memory below 20% (approx.) density, which again is expected behavior. We can definitely put the graphs in the docs for quick comparison. Please also include 1-D and 2-D scipy.sparse comparisons here.
@prady0t Fixed timing and added scipy.sparse (only added 2-D since even 1-D would just be 2-D but with a 1 length dimension).
Looks good. Thanks for the efforts! Can you also make a PR adding these images to docs.
In addition, the script to generate these plots would also be welcome in the same PR.