arkouda
arkouda copied to clipboard
Optimize `dataframe.corr` by using correlation matrix's symmetry
My first implementation of DataFrame.corr
doesn't take advantage of the symmetric nature of the correlation matrix, it just applies corr
between every combination of columns using nested loops. I'll demonstrate this in python
Set up for both examples
x = pd.Series(np.arange(10, 20))
y = pd.Series(np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48]))
z = pd.Series(np.array([i**2 for i in range(10)]))
data = [x, y, z]
cols = ['x', 'y', 'z']
naive implementation:
corr_matrix = {}
for d1, c in zip(data, cols):
corr_arr = np.arange(len(data), dtype=np.float64)
for i, d2 in enumerate(data):
corr_arr[i] = d1.corr(d2)
corr_matrix[c] = corr_arr
pd.DataFrame(corr_matrix)
# x y z
# 0 1.000000 0.758640 0.962691
# 1 0.758640 1.000000 0.813931
# 2 0.962691 0.813931 1.000000
For every element of data
, we do a full loop over data
. So our complexity
$$n + n + \dots + n + n = \sum_i^n n = n^2 = O(n^2)$$
This duplicates work since the matrix is symmetric (corr(d1,d2) == corr(d2,d1)
)
A more efficient method could use a chapel matrix, so we can set [i][j]
and [j][i]
at the same time
Slightly optimized implentation:
corr_matrix = np.zeros((len(data), len(data)))
for i in range(len(data)):
for j in range(i, len(data)):
corr_matrix[i,j] = data[i].corr(data[j])
corr_matrix[j,i] = corr_matrix[i,j]
pd.DataFrame(corr_matrix, columns=cols)
# x y z
# 0 1.000000 0.758640 0.962691
# 1 0.758640 1.000000 0.813931
# 2 0.962691 0.813931 1.000000
On iteration i
we only have to loop over n-i
elements, so the complexity becomes
$$n + (n-1) + \dots + 2 + 1 = \sum_i^n i = \frac{n(n+1)}{2} = O(n^2)$$
.... SO we're still at $n^2$ complexity lol. We don't get an asymptotic speedup from this but it will still save us some work.
I figure ethan and i will take point on this issue but i'm curious if anyone has a better algorithm @Ethan-DeBandi99 @reuster986 @mhmerrill @hokiegeek2