shogun Improve a Shogun algorithm

Entrace task for the GSoC project "the usual suspects", see here. Also a good entrance task for any other GSoC project.

Many of Shoguns algorithms have problems, especially the more basic ones. We might not even know about this. This entrance task is to

Pick a simple ML algorithm (see below)
Write a script that benchmarks Shogun against scikit-learn or MLPack on a few simple cases. This should address multiple aspects: correctness, speed, memory consumption, robustness, ease of use.
If Shogun is significantly worse than a competing implementation:

Identify bottlenecks in code. These can be of statistical nature, implementation problems, bugs, etc.
Fix them. We can help you here.
Give the code a clean-up (we like easy to read code, even it might not seem like that). Things to consider might be
Avoid using LAPACK/Eigen but using the linalg interface (can be extended through this)
get rid of old feature vector representations based on pointers and use SGVector and friends instead.
...
If you are already touching the code, why not give the interface documentation a bit of love: write if not exists, fix typos, make it clearer
unit test your changes

Some candidates to start with.

KMeans (see #2987)
KNN
- LARS
- Linear regression
- KRR
- Ridge regression
- GMM models
- KDE
- HMM (more messy)

Feb 17 '16 16:02 karlnapf

@karlnapf can you please direct me to a specific but very basic issue/bug that I can start with? Found #2987 a bit involved.

Feb 17 '16 21:02 ibrahim5253

Yes, you might want to get rid of the LAPACK calls in KRR

These can be replace with Eigen3 calls.

Unit testing is also not existing here.

In addition, if you feel brave, you might want to add a linear solve to our linalg library. @lambday can give hints here

Feb 18 '16 12:02 karlnapf

@karlnapf I made this notebook for comparison between Least Square Linear Regression in sklearn and in Shogun.

As demonstrated in the notebook it seems that Shogun has no problem in speed but there is a problem with accuracy especially when the data shifted away from the zero ( e.g. the third example).

The reason is that the solution provided by Shogun's LeastSquareRegression is a linear method with bias 0. As appears in the train_machine method that the bias is not taken into consideration and has to be set manually through set_bias().

This can be modified easily by adding the bias as an additional feature and the value of this feature will be one along the observations. I searched CFeatures class but I didn't find a way to add a new feature to a set of features. So an implementation to this issue is to get the feature matrix from the input CFeature then copy it to a larger array with the bias then transform it into CFeature again but I think this implementation is a little bit dirty.

Mar 01 '16 18:03 youssef-emad

Nice one, feel free to send a patch with corrections

In linear regression, the bias term is the data mean. You can compute and add that explicitly. This is a great catch and should be fixed. Definitely, Shogun should take care of that by default.
you could get rid of the ugly lapack call cblas_daxpy
- either replace with a linalg solve. For that you would have to add that, see the Readme
- use eigen3 for the solve

Finally, could you test this on a larger dataset in higher dimensions, can be synthetic as this is just for speed. Both N and D should be in [small, medium, large] where

small: few hundred
medium: 2000
large: >10000

Make sure the Shogun implementation at any point in time only has one matrix in memory (i.e. 10000x10000). You can profile the memory usage as well.

Mar 01 '16 19:03 karlnapf

BTW the docs state that the bias is set to 0. You can double check them as well, and clean up if there are problems. But even if the docs mention the 0 bias, we should still change default behaviour

Mar 01 '16 19:03 karlnapf

@karlnapf I got it.I will push the last edits for the pull request and start working on this issue.

Mar 01 '16 19:03 youssef-emad

Can you also check whether the algorithm works when a preprocessor is attached to the features? There are for example zero-mean preprocessors. BTW I think about the bias, we should have a boolean flag in the constructor that allows to turn off the bias (e.g. when the mean was remove already), where the default is that it is on.

Mar 01 '16 22:03 karlnapf

@youssef-emad Could we move the discussion in its own issue? You can just open one if you have updates...

Mar 02 '16 18:03 karlnapf

@karlnapf I don't have updates yet. I got a little bit busy with school but I'll start working on it by tomorrow.

Mar 02 '16 18:03 youssef-emad

@karlnapf , I am not that experienced in ML.Can you give me something I could start with?

Mar 04 '16 05:03 Anjan1729

@Anjan1729 there is a long list above. Pick any of the ones you like. Easy algorithms preferred, what about LDA?

Mar 04 '16 23:03 karlnapf

Hi @karlnapf

I created a ipython notebook which compares Shogun PCA with other toolkits like scikit-learn, matplotlib and my naive python implementation of PCA. Shogun PCA is equally well in terms of speed but matrix received by get_transformation_matrix() method of Shogun is scaled more as compared to other standard toolkits. I am studying Shogun PCA source code to identify any bottlenecks and will soon come up with update. Please give suggestions, if any.

Mar 07 '16 01:03 amoudgl

Hi @abhinavagarwalla Great! Can you open a seperate issue for this and we discuss there? We should keep this thread clean of discussions

Mar 07 '16 09:03 karlnapf

Hello @karlnapf
I want to contribute in improving algorithm but i like to work in python .can you advice me from where to start .

Mar 08 '16 20:03 sunil-sangwan

@shark-S shogun is a c++ library that has a python interface (as well). if you want to work on shogun you should get familiar with c++

Mar 08 '16 20:03 vigsterkr

thanks for reply . I have also knowledge of c++ . but can i make models or test algorithms in python or i have to do it in c++ ?

Mar 08 '16 20:03 sunil-sangwan

@shark-S you cab test in pyhon but write it in only in c++

Mar 09 '16 00:03 vigsterkr

@karlnapf I made this notebook as a comparison between Shogun and SKlearn in multiple regression models ( Linear Ridge Regression, Lasso ,LARS , KRR). I used 4 datasets of different sizes and dimensions. Dataset 1: size 400 , Dimensions 13 Dataset 2: size 700 , Dimensions 1 Dataset 3: size 17000 , Dimensions 20 Dataset 4: size 20000 , Dimensions 20

Observations: 1- Linear Ridge Regressionand KRR: Shogun is faster with approximately the same RSS.

2- Lasso: Shogun is faster but It seems like there is something wrong as there was no weight set to zero and the resulted weights seemed close to what achieved using Ridge Regression.

3- LARS and Lasso: The notebook kernel crashed at the third dataset. I think This happened because the dataset needed a very large number of iterations to converge. This was handled at SKlearn by setting a default maximum number of iterations = 1000.

As I understand from the docs that to have a Lasso Regression Model , We can use LARS with parameter lasso=True but the resulted weights are the same.Did I get it wrong ?

Mar 15 '16 17:03 youssef-emad

@youssef-emad great work! we would really like to have a more maintainable and reproducable way of doing benchmarking, so it'd be great if you could port this idea to this framework: https://github.com/zoq/benchmarks

as this in this current form is just a snapshot of each library... and it'll be hard to re-run this in newer releases... as well as you'll see that there are many other libraries we should include in the comparison

Mar 15 '16 17:03 vigsterkr

@vigsterkr yeah sure , I just have an assignment to deliver in a few hours so I'll finish it and check this out :smiley:

Mar 15 '16 18:03 youssef-emad

Thanks, very nice work. @vigsterkr is right, these notebooks can only be a first step to see what is going on. Eventually we want the scripts to do this in the benchmark platform -- so that we can re-run. But of course the notebooks are a nice way of exploring.

A few words on the benchmarks:

Aim for benchmarks that take at least a few seconds to run. Otherwise we only observe noise of the python interpreter being fired up. We want actual runtime. One or two short ones are ok to test this overhad. But in general, what counts is performance on longer runs.
N is large enough -- but can be larger (see first point) , what about larger D?
easy problems usually run faster than hard ones. Try to create harder regression problems -- higher dimensions -- more noise. Using N=20000, D=3 and linear functions does not represent an interesting problem.
Make sure the regulariser is set to the same value in both shogun and sklearn
if you find bugs (like a crash or wrong result), always report them in an issue and give a way to reproduce the bug

Mar 15 '16 20:03 karlnapf

Good luck with the assignment :)

Mar 15 '16 20:03 karlnapf

@karlnapf @vigsterkr I added new 4 benchmarks and made a pull request at the benchmarks repo

I also made a notebook to compare the ease of use and accuracy between Shogun and Sklearn for 4 different classifiers ( Naive bayes , KNN,QDA,Logestic Regression) and to check large datasets with larger dimensions. It seems Shogun have a problem with large dimensions and large number of classes

Note: Sorry for the delay , I was stuck with some academic commitments.

Mar 19 '16 10:03 youssef-emad

Great that this finally happened! For the benchmarks, I guess we need to compare against something, e.g. sklearn. As I can see here, the comparison so far only happens in your notebook?.

Mar 19 '16 12:03 karlnapf

In the notebook, it would be great if you could label things you print, it takes some time to parse this otherwise for me

Mar 19 '16 12:03 karlnapf

Comments:

Use larger datasets -- timings at microsecond scale have very little meaning as Python takes longer to start running than the actual computations.
You should always make sure that the options are set in such way that the algorithms do exactly the same

Mar 19 '16 13:03 karlnapf

Looks like you identified problems in Shogun's results. These should definitely be investigated. Very alarming. I suggest that you put benchmark scripts for sklearn as well. Also, let's isolate one of the algorithms and try to understand why Shogun's results are so different Very nice catch btw :) ! But I dont really agree with your conclusion ... we better find out whats going on there

Mar 19 '16 13:03 karlnapf

@karlnapf I'll add the benchmarks for sklearn and I'll try to investigate what's going on.

Mar 19 '16 14:03 youssef-emad

Start with one of the algos maybe. KNN or so

Mar 19 '16 14:03 karlnapf

@karlnapf I found out what was going on. I made a horrible mistake while transforming data to shogun's format. I fixed this embarrassing mistake and add visual comparison for decision boundaries for datasets with 2 features.Check the updated notebook.

New Observations: 1- KNN and Naive Bayes : Accuracy and decision boundaries are identical at both Shogun and sklearn at all datasets. 2- QDA and Logestic Regression : Accuracy and decision boundaries are approximately identical at the first 2 datasets but for the third and the fourth datasets ( more classes and more dimensions) , Shogun's Accuracy seems to be lower. I'll try to investigate that.

I also checked the options and ensured that the same options are applied to both Shogun and Sklearn. Final Note: Sorry again for that mistake.

Mar 19 '16 17:03 youssef-emad

shogun shogun copied to clipboard

Improve a Shogun algorithm

shogun
shogun copied to clipboard