sparse_dot
sparse_dot copied to clipboard
sdm.dot_product_mkl ends with segmentation fault for sparse-sparse multiplication
Hi, thanks for your packages for the mkl python interface. It's quite useful. However we encounter some unexpected results when performing sparse-sparse matrix multiplication. It sometimes leads to a segmentation fault.
A minimum code snippet to reproduce the bug: Please first download the following two matrices (the bug only appears for certain matrices) A.npz: https://drive.google.com/file/d/1NRT8SchOS3XefZokbFOpqJw6CIygTEQ- B.npz: https://drive.google.com/file/d/1aFDa2BbNQRGmmlAceIjK4JoogVQfKJY_/
import scipy.sparse as sparse
import sparse_dot_mkl as sdm
sdm.mkl_set_num_threads_local(1)
A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)
print("A.shape:", A.shape, "B.shape:", B.shape)
C = sdm.dot_product_mkl(A.T, B.T) # segmentation fault
# C = sdm.dot_product_mkl(B, A).T # works normally
print(C.shape)
If the first line is called, it will cause a segmentation fault. However, if the second line is called, the segmentation fault does not happen. We also try to first transform A matrix into coo format, or print(A) prior to the matrix multiplication, and the segmentation fault won't happen either. However it is quite uncomfortable because we didn't find the exact reason for that. So we turn to your help for this. Thank you in advance. (We tried this on multiple machines and for this example it always happens)
I can replicate this 100% with the code and files provided, and the segfault happens internal to mkl_sparse_spmm.
I fix this 100% by copying the loaded object once before passing them into the multiplication. Does this problem always happen after deserializing data from files?
import scipy.sparse as sparse
import sparse_dot_mkl as sdm
sdm.mkl_set_num_threads_local(1)
sdm.set_debug_mode(True)
A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)
A = sparse.csr_matrix(A, copy=True)
B = sparse.csr_matrix(B, copy=True)
C = sdm.dot_product_mkl(A.T, B.T)
Thanks for your quick response. No. I think the problem doesn't originated from reading the files.
import scipy.sparse as sparse
import sparse_dot_mkl as sdm
sdm.mkl_set_num_threads_local(1)
A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)
print("A.shape:", A.shape, "B.shape:", B.shape)
A[0, 0] = 0.001
sparse.save_npz("A_new.npz", A)
A = sparse.load_npz("./A_new.npz")
C = sdm.dot_product_mkl(A.T, B.T) # segmentation fault
# C = sdm.dot_product_mkl(B, A).T # works normally
print(C.shape)
If I tried to modify the matrix and save it again. Then no segfault for A_new.T * B.T
In our use case we don't save the matrix into files and segfault still happens. (so we try to dump the matrix and see what it happens)
I'll see what I can do. Unfortunately, even though I can replicate it 100% of the time, it's not occurring when I run it with valgrind.
Copying the indices (A.indices = A.indices.copy()) is enough to suppress the problem, which does not really help figure out the root cause at all.
Thanks a lot. Maybe we will tentatively use the copy trick to bypass this issue. Hope that one day you may find out the reason. Thanks.
Sadly in our use case the segfault still happens for other matrices even A.indices = A.indices.copy is used.
With gdb I can see that it's segfaulting in the same place in mkl_sparse_d_do_sp2m_i4_avx2, mkl_sparse_d_do_sp2m_i4_avx2, mkl_sparse_d_do_sp2m_i8_avx2, and mkl_sparse_d_do_sp2m_i8_avx2. It's not a copy/own array issue with the indices because they're copied to cast them for the i8 routines.
Helgrind suggests that there's a race condition in mkl_sparse_s_convert_csr_i4_avx2 in the MKL worker thread and a numpy memmove from PyArray_NewCopy in the python thread, which might be what's segfaulting, but I can't actually get it to happen when the debugger is running. I don't know why numpy would be instantiating a copy and deallocating an array while MKL is working.
C = sdm.dot_product_mkl(A.T.tocsr(), B.T.tocsr())
I suspect that converting the CSC to a CSR in python ahead of time would fix this.