arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Optimize `dataframe.corr` by using correlation matrix's symmetry

Open stress-tess opened this issue 2 years ago • 0 comments

My first implementation of DataFrame.corr doesn't take advantage of the symmetric nature of the correlation matrix, it just applies corr between every combination of columns using nested loops. I'll demonstrate this in python

Set up for both examples
x = pd.Series(np.arange(10, 20))
y = pd.Series(np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48]))
z = pd.Series(np.array([i**2 for i in range(10)]))
data = [x, y, z]
cols = ['x', 'y', 'z']

naive implementation:

corr_matrix = {}

for d1, c in zip(data, cols):
    corr_arr = np.arange(len(data), dtype=np.float64)
    for i, d2 in enumerate(data):
        corr_arr[i] = d1.corr(d2)
    corr_matrix[c] = corr_arr

pd.DataFrame(corr_matrix)
#           x         y         z
# 0  1.000000  0.758640  0.962691
# 1  0.758640  1.000000  0.813931
# 2  0.962691  0.813931  1.000000

For every element of data, we do a full loop over data. So our complexity $$n + n + \dots + n + n = \sum_i^n n = n^2 = O(n^2)$$

This duplicates work since the matrix is symmetric (corr(d1,d2) == corr(d2,d1))


A more efficient method could use a chapel matrix, so we can set [i][j] and [j][i] at the same time

Slightly optimized implentation:

corr_matrix = np.zeros((len(data), len(data)))

for i in range(len(data)):
    for j in range(i, len(data)):
        corr_matrix[i,j] = data[i].corr(data[j])
        corr_matrix[j,i] = corr_matrix[i,j]

pd.DataFrame(corr_matrix, columns=cols)
#           x         y         z
# 0  1.000000  0.758640  0.962691
# 1  0.758640  1.000000  0.813931
# 2  0.962691  0.813931  1.000000

On iteration i we only have to loop over n-i elements, so the complexity becomes

$$n + (n-1) + \dots + 2 + 1 = \sum_i^n i = \frac{n(n+1)}{2} = O(n^2)$$

.... SO we're still at $n^2$ complexity lol. We don't get an asymptotic speedup from this but it will still save us some work.

I figure ethan and i will take point on this issue but i'm curious if anyone has a better algorithm @Ethan-DeBandi99 @reuster986 @mhmerrill @hokiegeek2

stress-tess avatar Aug 05 '22 18:08 stress-tess