shogun
shogun copied to clipboard
Improve a Shogun algorithm
Entrace task for the GSoC project "the usual suspects", see here. Also a good entrance task for any other GSoC project.
Many of Shoguns algorithms have problems, especially the more basic ones. We might not even know about this. This entrance task is to
- Pick a simple ML algorithm (see below)
- Write a script that benchmarks Shogun against scikit-learn or MLPack on a few simple cases. This should address multiple aspects: correctness, speed, memory consumption, robustness, ease of use.
- If Shogun is significantly worse than a competing implementation:
- Identify bottlenecks in code. These can be of statistical nature, implementation problems, bugs, etc.
- Fix them. We can help you here.
- Give the code a clean-up (we like easy to read code, even it might not seem like that). Things to consider might be
- Avoid using LAPACK/Eigen but using the
linalg
interface (can be extended through this) - get rid of old feature vector representations based on pointers and use
SGVector
and friends instead. - ...
- If you are already touching the code, why not give the interface documentation a bit of love: write if not exists, fix typos, make it clearer
- unit test your changes
Some candidates to start with.
- KMeans (see #2987)
- KNN
- LARS
- Linear regression
- KRR
- Ridge regression
- GMM models
- KDE
- HMM (more messy)
@karlnapf can you please direct me to a specific but very basic issue/bug that I can start with? Found #2987 a bit involved.
Yes, you might want to get rid of the LAPACK calls in KRR
These can be replace with Eigen3 calls.
Unit testing is also not existing here.
In addition, if you feel brave, you might want to add a linear solve to our linalg
library. @lambday can give hints here
@karlnapf I made this notebook for comparison between Least Square Linear Regression in sklearn and in Shogun.
As demonstrated in the notebook it seems that Shogun has no problem in speed but there is a problem with accuracy especially when the data shifted away from the zero ( e.g. the third example).
The reason is that the solution provided by Shogun's LeastSquareRegression is a linear method with bias 0. As appears in the train_machine method that the bias is not taken into consideration and has to be set manually through set_bias().
This can be modified easily by adding the bias as an additional feature and the value of this feature will be one along the observations. I searched CFeatures class but I didn't find a way to add a new feature to a set of features. So an implementation to this issue is to get the feature matrix from the input CFeature then copy it to a larger array with the bias then transform it into CFeature again but I think this implementation is a little bit dirty.
Nice one, feel free to send a patch with corrections
- In linear regression, the bias term is the data mean. You can compute and add that explicitly. This is a great catch and should be fixed. Definitely, Shogun should take care of that by default.
- you could get rid of the ugly lapack call
cblas_daxpy
- either replace with a
linalg
solve. For that you would have to add that, see the Readme - use eigen3 for the solve
- either replace with a
Finally, could you test this on a larger dataset in higher dimensions, can be synthetic as this is just for speed. Both N and D should be in [small, medium, large] where
- small: few hundred
- medium: 2000
- large: >10000
Make sure the Shogun implementation at any point in time only has one matrix in memory (i.e. 10000x10000). You can profile the memory usage as well.
BTW the docs state that the bias is set to 0. You can double check them as well, and clean up if there are problems. But even if the docs mention the 0 bias, we should still change default behaviour
@karlnapf I got it.I will push the last edits for the pull request and start working on this issue.
Can you also check whether the algorithm works when a preprocessor is attached to the features? There are for example zero-mean preprocessors. BTW I think about the bias, we should have a boolean flag in the constructor that allows to turn off the bias (e.g. when the mean was remove already), where the default is that it is on.
@youssef-emad Could we move the discussion in its own issue? You can just open one if you have updates...
@karlnapf I don't have updates yet. I got a little bit busy with school but I'll start working on it by tomorrow.
@karlnapf , I am not that experienced in ML.Can you give me something I could start with?
@Anjan1729 there is a long list above. Pick any of the ones you like. Easy algorithms preferred, what about LDA?
Hi @karlnapf
I created a ipython notebook which compares Shogun PCA with other toolkits like scikit-learn, matplotlib and my naive python implementation of PCA. Shogun PCA is equally well in terms of speed but matrix received by get_transformation_matrix() method of Shogun is scaled more as compared to other standard toolkits. I am studying Shogun PCA source code to identify any bottlenecks and will soon come up with update. Please give suggestions, if any.
Hi @abhinavagarwalla Great! Can you open a seperate issue for this and we discuss there? We should keep this thread clean of discussions
Hello @karlnapf
I want to contribute in improving algorithm but i like to work in python .can you advice me from where to start .
@shark-S shogun is a c++ library that has a python interface (as well). if you want to work on shogun you should get familiar with c++
thanks for reply . I have also knowledge of c++ . but can i make models or test algorithms in python or i have to do it in c++ ?
@shark-S you cab test in pyhon but write it in only in c++
@karlnapf I made this notebook as a comparison between Shogun and SKlearn in multiple regression models ( Linear Ridge Regression, Lasso ,LARS , KRR). I used 4 datasets of different sizes and dimensions. Dataset 1: size 400 , Dimensions 13 Dataset 2: size 700 , Dimensions 1 Dataset 3: size 17000 , Dimensions 20 Dataset 4: size 20000 , Dimensions 20
Observations: 1- Linear Ridge Regressionand KRR: Shogun is faster with approximately the same RSS.
2- Lasso: Shogun is faster but It seems like there is something wrong as there was no weight set to zero and the resulted weights seemed close to what achieved using Ridge Regression.
3- LARS and Lasso: The notebook kernel crashed at the third dataset. I think This happened because the dataset needed a very large number of iterations to converge. This was handled at SKlearn by setting a default maximum number of iterations = 1000.
As I understand from the docs that to have a Lasso Regression Model , We can use LARS with parameter lasso=True
but the resulted weights are the same.Did I get it wrong ?
@youssef-emad great work! we would really like to have a more maintainable and reproducable way of doing benchmarking, so it'd be great if you could port this idea to this framework: https://github.com/zoq/benchmarks
as this in this current form is just a snapshot of each library... and it'll be hard to re-run this in newer releases... as well as you'll see that there are many other libraries we should include in the comparison
@vigsterkr yeah sure , I just have an assignment to deliver in a few hours so I'll finish it and check this out :smiley:
Thanks, very nice work. @vigsterkr is right, these notebooks can only be a first step to see what is going on. Eventually we want the scripts to do this in the benchmark platform -- so that we can re-run. But of course the notebooks are a nice way of exploring.
A few words on the benchmarks:
- Aim for benchmarks that take at least a few seconds to run. Otherwise we only observe noise of the python interpreter being fired up. We want actual runtime. One or two short ones are ok to test this overhad. But in general, what counts is performance on longer runs.
- N is large enough -- but can be larger (see first point) , what about larger D?
- easy problems usually run faster than hard ones. Try to create harder regression problems -- higher dimensions -- more noise. Using N=20000, D=3 and linear functions does not represent an interesting problem.
- Make sure the regulariser is set to the same value in both shogun and sklearn
- if you find bugs (like a crash or wrong result), always report them in an issue and give a way to reproduce the bug
Good luck with the assignment :)
@karlnapf @vigsterkr I added new 4 benchmarks and made a pull request at the benchmarks repo
I also made a notebook to compare the ease of use and accuracy between Shogun and Sklearn for 4 different classifiers ( Naive bayes , KNN,QDA,Logestic Regression) and to check large datasets with larger dimensions. It seems Shogun have a problem with large dimensions and large number of classes
Note: Sorry for the delay , I was stuck with some academic commitments.
Great that this finally happened! For the benchmarks, I guess we need to compare against something, e.g. sklearn. As I can see here, the comparison so far only happens in your notebook?.
In the notebook, it would be great if you could label things you print, it takes some time to parse this otherwise for me
Comments:
- Use larger datasets -- timings at microsecond scale have very little meaning as Python takes longer to start running than the actual computations.
- You should always make sure that the options are set in such way that the algorithms do exactly the same
Looks like you identified problems in Shogun's results. These should definitely be investigated. Very alarming. I suggest that you put benchmark scripts for sklearn as well. Also, let's isolate one of the algorithms and try to understand why Shogun's results are so different Very nice catch btw :) ! But I dont really agree with your conclusion ... we better find out whats going on there
@karlnapf I'll add the benchmarks for sklearn and I'll try to investigate what's going on.
Start with one of the algos maybe. KNN or so
@karlnapf I found out what was going on. I made a horrible mistake while transforming data to shogun's format. I fixed this embarrassing mistake and add visual comparison for decision boundaries for datasets with 2 features.Check the updated notebook.
New Observations: 1- KNN and Naive Bayes : Accuracy and decision boundaries are identical at both Shogun and sklearn at all datasets. 2- QDA and Logestic Regression : Accuracy and decision boundaries are approximately identical at the first 2 datasets but for the third and the fourth datasets ( more classes and more dimensions) , Shogun's Accuracy seems to be lower. I'll try to investigate that.
I also checked the options and ensured that the same options are applied to both Shogun and Sklearn. Final Note: Sorry again for that mistake.