Improve and add Spearman
This PR should improve the already existing problem #167 .
I will revise the code and make it integrable.
I'll open this PR so you can see the progress.
The tests are not running yet and I'm still checking if the cPython function really works.
Hey @NicolasHug,
I am unsure whether the rankings are calculated correctly.
I think that the rankings are calculated by the columns and not by the rows of yr.
Or am I wrong ?
what do you mean the rows of yr? yr is a dictionary so it doesn't have rows or columns.
yr is equal to ur or ir (from the trainset object), depending on the user_based parameter.
I split the comments a bit, so that it stays clear ^^
I refer to the dict yr from test_similarities.py:
n_x = 8
yr_global = {
0: [(0, 3), (1, 3), (2, 3), (5, 1), (6, 1.5), (7, 3)], # noqa
1: [(0, 4), (1, 4), (2, 4), ], # noqa
2: [ (2, 5), (3, 2), (4, 3) ], # noqa
3: [ (1, 1), (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
4: [ (1, 5), (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa
}
The current code would return this result:
sim = sims.spearman(n_x, yr, min_support=0)
[[ 1. 1. 1. 0. 0. 0. 0. 0. ]
[ 1. 1. -0.595 0. 0. -0.999 -0.902 0.746]
[ 1. -0.595 1. 1. 0. 0.789 0.412 -0.143]
[ 0. 0. 1. 1. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 1. 0. 0. 0. ]
[ 0. -0.999 0.789 0. 0. 1. 0.885 -0.721]
[ 0. -0.902 0.412 0. 0. 0.885 1. -0.961]
[ 0. 0.746 -0.143 0. 0. -0.721 -0.961 1. ]]
I rounded it to three decimal places.
But in fact it should have come out:
[[ 1. 0.335 -0.057 -0.645 -0.645 -0.516 -0.516 -0.057]
[ 0.335 1. -0.821 -0.866 -0.866 0.154 0.154 0.359]
[-0.057 -0.821 1. 0.74 0.74 -0.5 -0.5 -0.816]
[-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
[-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
[-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
[-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
[-0.057 0.359 -0.816 -0.444 -0.444 0.579 0.579 1. ]]
The second matrix is calculated according to the formula used in the code.
This formula corresponds to this one here: https://mathcracker.com/spearman-correlation-calculator.php
(Sorry, but I just can't find a better source)
@gautamramk used to calculate the rank:
....
rows = np.zeros(n_x, np.double)
for y, y_ratings in iteritems(yr):
for xi, ri in y_ratings:
rows[xi] = ri
ranks = rankdata(rows)
....
But actually the ranks should be determined by the columns from yr, because these represent e.g. the users and their choices.
Therefore I think that the current code is not correct and it does not calculate the Spearman Correlation.
I have written the following program to show how I think the Spearman Correlation would be calculated correctly for min_sqrt == 0.
This code was also used to generate the second matrix.
import numpy as np
from scipy.stats import rankdata
# yr_global as a matrix representation
matrix = np.array([[3., 3., 3., 0., 0., 1., 1.5, 3.],
[4., 4., 4., 0., 0., 0., 0., 0.],
[0., 0., 5., 2., 3., 0., 0., 0.],
[0., 1., 4., 2., 3., 3., 3.5, 2.],
[0., 5., 1., 0., 0., 2., 2.5, 2.5]])
n = len(matrix)
dim = len(matrix[0])
result = np.zeros((dim, dim))
for x in range(0, dim):
rank_x = rankdata(matrix[:, x])
for y in range(x, dim):
rank_y = rankdata(matrix[:, y])
prod = np.dot(rank_x, rank_y)
sum_rx = sum(rank_x)
sum_ry = sum(rank_y)
sum_rxs = sum(rank_x ** 2)
sum_rys = sum(rank_y ** 2)
counter = n * prod - (sum_rx * sum_ry)
denom_l = np.sqrt(n * sum_rxs - sum_rx ** 2)
denom_r = np.sqrt(n * sum_rys - sum_ry ** 2)
result[x, y] = round(counter / (denom_r * denom_l), 3)
result[y, x] = result[x, y]
print(result)
So I wonder if I'm totally wrong.
If so, where is my mistake ?
If not, then I would completely rework the spearman method and create it similar to the example code.
Many thanks Marc
Hey @NicolasHug, I think I fixed the bug by introducing a ranking matrix.
The calculation in the previous version does not seem to be correct. The rank is calculated wrong.
The rank must not be calculated with the rows of yr but with the columns.
Like e.g.
----[[3., 3., 3., 0., 0., 1., 1.5, 3.],-----
[4., 4., 4., 0., 0., 0., 0., 0.],
[0., 0., 5., 2., 3., 0., 0., 0.],
[0., 1., 4., 2., 3., 3., 3.5, 2.],
[0., 5., 1., 0., 0., 2., 2.5, 2.5]]
Here the ranks would be calculated with the old version:
[6.5, 6.5, 6.5, 1.5, 1.5, 3. , 4. , 6.5]
But it would have to be calculated like this:
|
|
|
[[3., 3., 3., 0., 0., 1., 1.5, 3.],
[4., 4., 4., 0., 0., 0., 0., 0.],
[0., 0., 5., 2., 3., 0., 0., 0.],
[0., 1., 4., 2., 3., 3., 3.5, 2.],
[0., 5., 1., 0., 0., 2., 2.5, 2.5]]
|
|
|
Whereby the new version calculates the ranks like this:
[4., 5., 2., 2., 2.]
The given tests now run without changes.
I also adjusted the documentation.
Excuse the wall of text above^^.
But now it should be correct
Hey @MaFeg100 , sorry for the slow reply.
I haven't looked in great details, but I think you are probably correct. Here is what I understand, let me know if you agree.
Spearman is like Pearson correlation but instead of using raw ratings we use the ranks.
Considering a rating matrix U x I (users are rows, items are columns), then a user-user spearman similarity would first compute the the ranks of the ratings in a row-wise fashion (and I think that's what you mean by "the columns of yr", but I'm not comfortable speaking about row or columsn for yr because it's not really a matrix), and then apply a pearson sim.
Note however that this is a simplified view, as in reality we want to compute the rankings between the common ratings only, not the whole rows. (Maybe this has actually no impact? I haven't thought about it much.)
For now maybe the most important is to make a quick benchmark (more thorough benchmarks can be done later on) to make sure the computation can run in a decent time, compared to the other sims. I wouldn't want you to lose your time trying to fix this if in the end we won't merge the PR because spearman sim is too slow :/
Then of course, we should fix the bugs if there are any.
Hope this does not add confusion ^^, thanks for looking that up anyway.
Hey @NicolasHug,
that's right. Spearman calculates the ranks in the user or item vectors. That's what I meant by the term columns.
My improvement transfers yr directly into the ranks and then proceeds exactly like Pearson.
The old version calculated the ranks in the wrong direction.
The current version also considers the common elements. This is then regulated exactly like Pearson by the freq matrix.
So the only difference is that yr must be transformed into a form in which the ranks and not the ratings are included.
I'll just put in the cross validations of the procedures that Spearman uses.
Interesting should be time expenditure and error rate.
Example 1: Spearman: item-based KNNBasic; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0370 1.0417 1.0520 1.0359 1.0446 1.0422 0.0058
MAE (testset) 0.8324 0.8359 0.8423 0.8289 0.8366 0.8352 0.0045
Fit time 1.99 2.00 1.99 2.04 2.06 2.02 0.03
Test time 2.72 2.81 2.88 2.77 2.83 2.80 0.05
Example 2: Cosine: item-based KNNBasic; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0214 1.0236 1.0327 1.0243 1.0297 1.0263 0.0042
MAE (testset) 0.8098 0.8109 0.8158 0.8072 0.8133 0.8114 0.0030
Fit time 1.22 1.27 1.24 1.27 1.18 1.24 0.03
Test time 2.83 2.77 2.82 2.85 2.80 2.81 0.03
Example 3: Pearson: item-based KNNBasic; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0428 1.0423 1.0380 1.0405 1.0457 1.0419 0.0025
MAE (testset) 0.8354 0.8369 0.8291 0.8330 0.8394 0.8347 0.0035
Fit time 1.79 1.80 1.78 1.79 1.78 1.79 0.01
Test time 2.73 2.80 2.75 2.81 2.88 2.80 0.05
First conclusion:
Example 1 to 3 show that Spearman is close to Pearson for the item-based approach with the MAE and RMSE.
You can also see that Spearman takes longer compared to Cosine and Pearson. This is because yr is first transformed into a "rank representation".
Nevertheless, the time is not extremely worse than with Cosine or Spearman.
Example 4: Spearman: user-based KNNBasic; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0096 1.0119 1.0051 1.0242 1.0108 1.0123 0.0064
MAE (testset) 0.8018 0.8034 0.7943 0.8143 0.8038 0.8035 0.0064
Fit time 1.12 1.18 1.21 1.16 1.75 1.29 0.23
Test time 2.43 2.46 2.45 3.11 2.65 2.62 0.26
Example 5: Cosine: user-based KNNBasic; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0095 1.0166 1.0234 1.0145 1.0188 1.0165 0.0046
MAE (testset) 0.7981 0.8035 0.8125 0.8025 0.8024 0.8038 0.0047
Fit time 0.70 0.78 0.71 0.68 0.68 0.71 0.04
Test time 2.65 2.57 2.48 2.43 2.52 2.53 0.07
Example 6: Pearson: user-based KNNBasic; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0054 1.0088 1.0122 1.0150 1.0165 1.0116 0.0041
MAE (testset) 0.8002 0.8006 0.8023 0.8045 0.8068 0.8029 0.0025
Fit time 0.96 0.99 0.98 0.97 0.99 0.98 0.01
Test time 2.45 2.45 2.57 2.52 2.53 2.50 0.04
Second conclusion
Also here you can see that Spearman is close to Pearson's RMSE and MAE.
But also here you can see that Spearman needs a bit more time in comparison.
Conclusion of the fast Banchmark
I think Spearman may well be considered.
The method differs only slightly in the programming effort to the Pearson correlation.
It also shows that similar RMSE and MAE values are obtained for both the item and the user-based approach.
In addition, the user-based approach even runs faster than the item-based approach.
Nevertheless, Pearson is lagging behind the other methods. This can be explained by the fact that the ranks are not directly available and yr must therefore first be converted into a "rank representation".
It can make sense to work on the determination of the ranks, since a large part of the time is required.
I hope my contribution is understandable^^
Ok, thanks a lot for the benchmark. I agree that the computation time is well-reasonable, comparatively.
I'll try to review it more in details soon.
Instead of converting yr, would it be possible to avoid it by also passing xr? That is, instead of passing only ur or only ir, we could maybe just pass both?
I've considered that, too.
I think you could bypass the time for structuring the matrix by passing xr additionally.
Nevertheless I think that you won't save much more time. Should xr be passed, the code would be similar to the old one from @gautamramk.
For this I have compared both old versions once.
As an example:
Old Pearson; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman_old',
'user_based': True }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
That would have resulted in:
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0110 1.0130 1.0068 1.0111 1.0064 1.0097 0.0026
MAE (testset) 0.8019 0.8037 0.8007 0.8047 0.7969 0.8016 0.0027
Fit time 1.32 1.49 1.42 1.31 1.39 1.39 0.07
Test time 2.77 2.71 2.78 2.70 2.73 2.74 0.03
New Pearson; MovieLens100k; 5-fold cross validation
data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
'user_based': True }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
That would have resulted in:
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.0156 1.0066 1.0147 1.0116 1.0132 1.0123 0.0032
MAE (testset) 0.8059 0.7994 0.8069 0.8023 0.8049 0.8039 0.0027
Fit time 1.40 1.47 1.26 1.27 1.21 1.32 0.10
Test time 2.62 2.74 2.94 2.80 2.68 2.75 0.11
You can see that the old version and the new version provide comparable times.
Also the bugs are similar. Nevertheless the new version passes the old tests whereas the old version does not pass them.
Furthermore, I think that the additional parameter xr will not change much in time. Most of the extra time is caused by calculating the ranks.
So I looked at the times between the code sections in the pre-process phase.
(1) Old Pearson Preprocess
The rank is also calculated in this area.
....
start = time.time()
for y, y_ratings in iteritems(yr):
for xi, ri in y_ratings:
rows[xi] = ri
ranks = rankdata(rows)
for xi, _ in y_ratings:
for xj, _ in y_ratings:
prods[xi, xj] += ranks[xi] * ranks[xj]
freq[xi, xj] += 1
sqi[xi, xj] += ranks[xi]**2
sqj[xi, xj] += ranks[xj]**2
si[xi, xj] += ranks[xi]
sj[xi, xj] += ranks[xj]
end = time.time()
t1 = end - start
print(t1)
....
This section achieves about: 0.9549877643585205 s
(2) New Pearson Preprocess Building the Matrix
....
start = time.time()
# turn yr into a matrix
for y, y_ratings in iteritems(yr):
for x_i, r_i in y_ratings:
matrix[y, x_i] = r_i
end = time.time()
t1 = end-start
print(t1)
....
This section achieves about: 0.030748605728149414 s
(3) New Pearson Preprocess Building the Rank Matrix
....
start = time.time()
# turn the yr matrix into a matrix which contains the ranks the elements in yr
for x_i in range(n_x):
matrix[:,x_i] = rankdata(matrix[:,x_i])
end = time.time()
t2 = end - start
print(t2)
....
This section achieves about: 0.13120055198669434 s
(4) New Pearson Preprocess
....
start = time.time()
for y, y_ratings in iteritems(yr):
for xi, ri in y_ratings:
# use the ranking matrix to get the elements row by row
ranks[xi] = matrix[y, xi]
for xi, _ in y_ratings:
for xj, _ in y_ratings:
prods[xi, xj] += ranks[xi] * ranks[xj]
freq[xi, xj] += 1
sqi[xi, xj] += ranks[xi]**2
sqj[xi, xj] += ranks[xj]**2
si[xi, xj] += ranks[xi]
sj[xi, xj] += ranks[xj]
end = time.time()
t3 = end - start
print(t3)
....
This section achieves about: 0.5745222568511963 s
So I think that an introduction of xr will not change much about the problem that the ranks have to be calculated.
Also you can see that the additional converting of yr doesn't take much time (see (2)). In addition, no more changes than necessary would have to be made.
I hope the short analysis is helpful.
I would be happy to receive a review of the code.
Thanks in advance !^^
Hey.
Thanks for the review.
I will deal with it in more detail tomorrow or next week.
It's interesting to see how the ranking behaves. I'm sure I'll come up with something about that.
I don't think the ties are a problem. They are taken into account when the rankings are created, because they are standardized ranks.
But I will take a closer look at the concerns again.
I have also taken a closer look at the Spearman Rank correlation.
I used the webscope-R3 dataset from Yahoo. The record contains 311704 reviews from 15.400 users for 1000 songs.
The record can be requested here: https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=3
For the comparison of the Spearman rank correlation I used a 5-fold cross validation.
For this I consider the fit time in comparison to the other methods with different distance measurement.

Here you can compare the amount of time spent with the size of the neighbourhood.

My code works according to this formula which I made myself.
After a few initial considerations, I think it makes no difference how the ranks are calculated.
Since the ranks are normalized, it makes no difference whether they are calculated over the entire sample or only over a sample.
But I think that this relationship can be examined more closely. Are you interested in a formal proof or proof to the contrary?
If the assumption is true, xr can actually be used instead of yr.
However, if it turns out to be wrong, it must be investigated whether the additional effort to calculate the common items and the corresponding ranks is actually faster than the current solution.