Surprise Improve and add Spearman

This PR should improve the already existing problem #167 .

I will revise the code and make it integrable.

I'll open this PR so you can see the progress.

Nov 15 '18 18:11 ghost

The tests are not running yet and I'm still checking if the cPython function really works.

Nov 15 '18 19:11 ghost

Hey @NicolasHug,

I am unsure whether the rankings are calculated correctly.

I think that the rankings are calculated by the columns and not by the rows of yr.

Or am I wrong ?

Nov 16 '18 17:11 ghost

what do you mean the rows of yr? yr is a dictionary so it doesn't have rows or columns.

yr is equal to ur or ir (from the trainset object), depending on the user_based parameter.

Nov 16 '18 18:11 NicolasHug

I split the comments a bit, so that it stays clear ^^

Nov 16 '18 22:11 ghost

I refer to the dict yr from test_similarities.py:

n_x = 8
yr_global = {
    0: [(0, 3), (1, 3), (2, 3), (5, 1), (6, 1.5), (7, 3)], # noqa
    1: [(0, 4), (1, 4), (2, 4), ], # noqa
    2: [ (2, 5), (3, 2), (4, 3) ], # noqa
    3: [ (1, 1), (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
    4: [ (1, 5), (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa
}

The current code would return this result:

sim = sims.spearman(n_x, yr, min_support=0)
[[ 1. 1. 1. 0. 0. 0. 0. 0. ]
 [ 1. 1. -0.595 0. 0. -0.999 -0.902 0.746]
 [ 1. -0.595 1. 1. 0. 0.789 0.412 -0.143]
 [ 0. 0. 1. 1. 0. 0. 0. 0. ]
 [ 0. 0. 0. 0. 1. 0. 0. 0. ]
 [ 0. -0.999 0.789 0. 0. 1. 0.885 -0.721]
 [ 0. -0.902 0.412 0. 0. 0.885 1. -0.961]
 [ 0. 0.746 -0.143 0. 0. -0.721 -0.961 1. ]]

I rounded it to three decimal places.

But in fact it should have come out:

[[ 1. 0.335 -0.057 -0.645 -0.645 -0.516 -0.516 -0.057]
 [ 0.335 1. -0.821 -0.866 -0.866 0.154 0.154 0.359]
 [-0.057 -0.821 1. 0.74 0.74 -0.5 -0.5 -0.816]
 [-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
 [-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
 [-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
 [-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
 [-0.057 0.359 -0.816 -0.444 -0.444 0.579 0.579 1. ]]

Nov 16 '18 22:11 ghost

The second matrix is calculated according to the formula used in the code.

This formula corresponds to this one here: https://mathcracker.com/spearman-correlation-calculator.php

(Sorry, but I just can't find a better source)

Nov 16 '18 22:11 ghost

@gautamramk used to calculate the rank:

....
    rows = np.zeros(n_x, np.double)

    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            rows[xi] = ri
        ranks = rankdata(rows)
....

But actually the ranks should be determined by the columns from yr, because these represent e.g. the users and their choices.

Therefore I think that the current code is not correct and it does not calculate the Spearman Correlation.

Nov 16 '18 22:11 ghost

I have written the following program to show how I think the Spearman Correlation would be calculated correctly for min_sqrt == 0.

This code was also used to generate the second matrix.

import numpy as np
from scipy.stats import rankdata

# yr_global as a matrix representation
matrix = np.array([[3., 3., 3., 0., 0., 1., 1.5, 3.],
                   [4., 4., 4., 0., 0., 0., 0., 0.],
                   [0., 0., 5., 2., 3., 0., 0., 0.],
                   [0., 1., 4., 2., 3., 3., 3.5, 2.],
                   [0., 5., 1., 0., 0., 2., 2.5, 2.5]])

n = len(matrix)
dim = len(matrix[0])

result = np.zeros((dim, dim))

for x in range(0, dim):
    rank_x = rankdata(matrix[:, x])
    for y in range(x, dim):
        rank_y = rankdata(matrix[:, y])

        prod = np.dot(rank_x, rank_y)
        sum_rx = sum(rank_x)
        sum_ry = sum(rank_y)
        sum_rxs = sum(rank_x ** 2)
        sum_rys = sum(rank_y ** 2)

        counter = n * prod - (sum_rx * sum_ry)
        denom_l = np.sqrt(n * sum_rxs - sum_rx ** 2)
        denom_r = np.sqrt(n * sum_rys - sum_ry ** 2)

        result[x, y] = round(counter / (denom_r * denom_l), 3)
        result[y, x] = result[x, y]

print(result)

Nov 16 '18 22:11 ghost

So I wonder if I'm totally wrong.

If so, where is my mistake ?

If not, then I would completely rework the spearman method and create it similar to the example code.

Many thanks Marc

Nov 16 '18 22:11 ghost

Hey @NicolasHug, I think I fixed the bug by introducing a ranking matrix.

The calculation in the previous version does not seem to be correct. The rank is calculated wrong.

The rank must not be calculated with the rows of yr but with the columns.

Like e.g.

          ----[[3., 3., 3., 0., 0., 1., 1.5, 3.],-----
               [4., 4., 4., 0., 0., 0., 0., 0.],
               [0., 0., 5., 2., 3., 0., 0., 0.],
               [0., 1., 4., 2., 3., 3., 3.5, 2.],
               [0., 5., 1., 0., 0., 2., 2.5, 2.5]]

Here the ranks would be calculated with the old version:

               [6.5, 6.5, 6.5, 1.5, 1.5, 3. , 4. , 6.5]

But it would have to be calculated like this:

                |
                |
                |
              [[3., 3., 3., 0., 0., 1., 1.5, 3.],
               [4., 4., 4., 0., 0., 0., 0., 0.],
               [0., 0., 5., 2., 3., 0., 0., 0.],
               [0., 1., 4., 2., 3., 3., 3.5, 2.],
               [0., 5., 1., 0., 0., 2., 2.5, 2.5]]
                |
                |
                |

Whereby the new version calculates the ranks like this:

              [4., 5., 2., 2., 2.]

The given tests now run without changes.

I also adjusted the documentation.

Excuse the wall of text above^^.

But now it should be correct

Nov 17 '18 14:11 ghost

Hey @MaFeg100 , sorry for the slow reply.

I haven't looked in great details, but I think you are probably correct. Here is what I understand, let me know if you agree.

Spearman is like Pearson correlation but instead of using raw ratings we use the ranks.

Considering a rating matrix U x I (users are rows, items are columns), then a user-user spearman similarity would first compute the the ranks of the ratings in a row-wise fashion (and I think that's what you mean by "the columns of yr", but I'm not comfortable speaking about row or columsn for yr because it's not really a matrix), and then apply a pearson sim.

Note however that this is a simplified view, as in reality we want to compute the rankings between the common ratings only, not the whole rows. (Maybe this has actually no impact? I haven't thought about it much.)

For now maybe the most important is to make a quick benchmark (more thorough benchmarks can be done later on) to make sure the computation can run in a decent time, compared to the other sims. I wouldn't want you to lose your time trying to fix this if in the end we won't merge the PR because spearman sim is too slow :/

Then of course, we should fix the bugs if there are any.

Hope this does not add confusion ^^, thanks for looking that up anyway.

Nov 18 '18 18:11 NicolasHug

Hey @NicolasHug,

that's right. Spearman calculates the ranks in the user or item vectors. That's what I meant by the term columns.

My improvement transfers yr directly into the ranks and then proceeds exactly like Pearson.

The old version calculated the ranks in the wrong direction.

The current version also considers the common elements. This is then regulated exactly like Pearson by the freq matrix.

So the only difference is that yr must be transformed into a form in which the ranks and not the ratings are included.

Nov 19 '18 11:11 ghost

I'll just put in the cross validations of the procedures that Spearman uses.

Interesting should be time expenditure and error rate.

Nov 19 '18 11:11 ghost

Example 1: Spearman: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0370  1.0417  1.0520  1.0359  1.0446  1.0422  0.0058  
MAE (testset)     0.8324  0.8359  0.8423  0.8289  0.8366  0.8352  0.0045  
Fit time          1.99    2.00    1.99    2.04    2.06    2.02    0.03    
Test time         2.72    2.81    2.88    2.77    2.83    2.80    0.05

Nov 19 '18 11:11 ghost

Example 2: Cosine: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0214  1.0236  1.0327  1.0243  1.0297  1.0263  0.0042  
MAE (testset)     0.8098  0.8109  0.8158  0.8072  0.8133  0.8114  0.0030  
Fit time          1.22    1.27    1.24    1.27    1.18    1.24    0.03    
Test time         2.83    2.77    2.82    2.85    2.80    2.81    0.03

Nov 19 '18 11:11 ghost

Example 3: Pearson: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0428  1.0423  1.0380  1.0405  1.0457  1.0419  0.0025  
MAE (testset)     0.8354  0.8369  0.8291  0.8330  0.8394  0.8347  0.0035  
Fit time          1.79    1.80    1.78    1.79    1.78    1.79    0.01    
Test time         2.73    2.80    2.75    2.81    2.88    2.80    0.05

Nov 19 '18 11:11 ghost

First conclusion:

Example 1 to 3 show that Spearman is close to Pearson for the item-based approach with the MAE and RMSE.

You can also see that Spearman takes longer compared to Cosine and Pearson. This is because yr is first transformed into a "rank representation".

Nevertheless, the time is not extremely worse than with Cosine or Spearman.

Nov 19 '18 11:11 ghost

Example 4: Spearman: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0096  1.0119  1.0051  1.0242  1.0108  1.0123  0.0064  
MAE (testset)     0.8018  0.8034  0.7943  0.8143  0.8038  0.8035  0.0064  
Fit time          1.12    1.18    1.21    1.16    1.75    1.29    0.23    
Test time         2.43    2.46    2.45    3.11    2.65    2.62    0.26

Nov 19 '18 11:11 ghost

Example 5: Cosine: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0095  1.0166  1.0234  1.0145  1.0188  1.0165  0.0046  
MAE (testset)     0.7981  0.8035  0.8125  0.8025  0.8024  0.8038  0.0047  
Fit time          0.70    0.78    0.71    0.68    0.68    0.71    0.04    
Test time         2.65    2.57    2.48    2.43    2.52    2.53    0.07

Nov 19 '18 11:11 ghost

Example 6: Pearson: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0054  1.0088  1.0122  1.0150  1.0165  1.0116  0.0041  
MAE (testset)     0.8002  0.8006  0.8023  0.8045  0.8068  0.8029  0.0025  
Fit time          0.96    0.99    0.98    0.97    0.99    0.98    0.01    
Test time         2.45    2.45    2.57    2.52    2.53    2.50    0.04

Nov 19 '18 11:11 ghost

Second conclusion

Also here you can see that Spearman is close to Pearson's RMSE and MAE.

But also here you can see that Spearman needs a bit more time in comparison.

Nov 19 '18 11:11 ghost

Conclusion of the fast Banchmark

I think Spearman may well be considered.

The method differs only slightly in the programming effort to the Pearson correlation.

It also shows that similar RMSE and MAE values are obtained for both the item and the user-based approach.

In addition, the user-based approach even runs faster than the item-based approach.

Nevertheless, Pearson is lagging behind the other methods. This can be explained by the fact that the ranks are not directly available and yr must therefore first be converted into a "rank representation".

It can make sense to work on the determination of the ranks, since a large part of the time is required.

I hope my contribution is understandable^^

Nov 19 '18 12:11 ghost

Ok, thanks a lot for the benchmark. I agree that the computation time is well-reasonable, comparatively.

I'll try to review it more in details soon.

Instead of converting yr, would it be possible to avoid it by also passing xr? That is, instead of passing only ur or only ir, we could maybe just pass both?

Nov 19 '18 14:11 NicolasHug

I've considered that, too.

I think you could bypass the time for structuring the matrix by passing xr additionally.

Nevertheless I think that you won't save much more time. Should xr be passed, the code would be similar to the old one from @gautamramk.

For this I have compared both old versions once.

As an example:

Old Pearson; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman_old',
               'user_based': True               }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

That would have resulted in:


                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0110  1.0130  1.0068  1.0111  1.0064  1.0097  0.0026  
MAE (testset)     0.8019  0.8037  0.8007  0.8047  0.7969  0.8016  0.0027  
Fit time          1.32    1.49    1.42    1.31    1.39    1.39    0.07    
Test time         2.77    2.71    2.78    2.70    2.73    2.74    0.03

New Pearson; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': True               }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

That would have resulted in:

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0156  1.0066  1.0147  1.0116  1.0132  1.0123  0.0032  
MAE (testset)     0.8059  0.7994  0.8069  0.8023  0.8049  0.8039  0.0027  
Fit time          1.40    1.47    1.26    1.27    1.21    1.32    0.10    
Test time         2.62    2.74    2.94    2.80    2.68    2.75    0.11

You can see that the old version and the new version provide comparable times.

Also the bugs are similar. Nevertheless the new version passes the old tests whereas the old version does not pass them.

Furthermore, I think that the additional parameter xr will not change much in time. Most of the extra time is caused by calculating the ranks.

So I looked at the times between the code sections in the pre-process phase.

(1) Old Pearson Preprocess

The rank is also calculated in this area.

....
    start = time.time()
    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            rows[xi] = ri
        ranks = rankdata(rows)
        for xi, _ in y_ratings:
            for xj, _ in y_ratings:
                prods[xi, xj] += ranks[xi] * ranks[xj]
                freq[xi, xj] += 1
                sqi[xi, xj] += ranks[xi]**2
                sqj[xi, xj] += ranks[xj]**2
                si[xi, xj] += ranks[xi]
                sj[xi, xj] += ranks[xj]
    end = time.time()
    t1 = end - start
    print(t1)
....

This section achieves about: 0.9549877643585205 s

(2) New Pearson Preprocess Building the Matrix

....
    start = time.time()
    # turn yr into a matrix
    for y, y_ratings in iteritems(yr):
        for x_i, r_i in y_ratings:
            matrix[y, x_i] = r_i
    end = time.time()
    t1 = end-start
    print(t1)
....

This section achieves about: 0.030748605728149414 s

(3) New Pearson Preprocess Building the Rank Matrix

....
    start = time.time()
    # turn the yr matrix into a matrix which contains the ranks the elements in yr
    for x_i in range(n_x):
        matrix[:,x_i] = rankdata(matrix[:,x_i])
    end = time.time()
    t2 = end - start
    print(t2)
....

This section achieves about: 0.13120055198669434 s

(4) New Pearson Preprocess

....
    start = time.time()
    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            # use the ranking matrix to get the elements row by row
            ranks[xi] = matrix[y, xi]
        for xi, _ in y_ratings:
            for xj, _ in y_ratings:
                prods[xi, xj] += ranks[xi] * ranks[xj]
                freq[xi, xj] += 1
                sqi[xi, xj] += ranks[xi]**2
                sqj[xi, xj] += ranks[xj]**2
                si[xi, xj] += ranks[xi]
                sj[xi, xj] += ranks[xj]
    end = time.time()
    t3 = end - start
    print(t3)
....

This section achieves about: 0.5745222568511963 s

So I think that an introduction of xr will not change much about the problem that the ranks have to be calculated.

Also you can see that the additional converting of yr doesn't take much time (see (2)). In addition, no more changes than necessary would have to be made.

Nov 19 '18 16:11 ghost

I hope the short analysis is helpful.

I would be happy to receive a review of the code.

Thanks in advance !^^

Nov 19 '18 16:11 ghost

Hey.

Thanks for the review.

I will deal with it in more detail tomorrow or next week.

It's interesting to see how the ranking behaves. I'm sure I'll come up with something about that.

I don't think the ties are a problem. They are taken into account when the rankings are created, because they are standardized ranks.

But I will take a closer look at the concerns again.

I have also taken a closer look at the Spearman Rank correlation.

I used the webscope-R3 dataset from Yahoo. The record contains 311704 reviews from 15.400 users for 1000 songs.

The record can be requested here: https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=3

For the comparison of the Spearman rank correlation I used a 5-fold cross validation.

For this I consider the fit time in comparison to the other methods with different distance measurement.

Uploading kNN_fit_time.png…

Dec 01 '18 18:12 ghost

knn_fit_time

Here you can compare the amount of time spent with the size of the neighbourhood.

Dec 01 '18 18:12 ghost

spearman001 spearman002

Dec 02 '18 13:12 ghost

My code works according to this formula which I made myself.

Dec 02 '18 13:12 ghost

After a few initial considerations, I think it makes no difference how the ranks are calculated.

Since the ranks are normalized, it makes no difference whether they are calculated over the entire sample or only over a sample.

But I think that this relationship can be examined more closely. Are you interested in a formal proof or proof to the contrary?

If the assumption is true, xr can actually be used instead of yr.

However, if it turns out to be wrong, it must be investigated whether the additional effort to calculate the common items and the corresponding ranks is actually faster than the current solution.

Dec 02 '18 13:12 ghost