QUIPP-pipeline
QUIPP-pipeline copied to clipboard
Correlated rank similarity metric
This PR adds support for Jenning's and Sebastian's correlated rank similarity metric.
Changes:
- Adds three new methods to the
RankingSimilarityclass ofrbo.py. These implement the correlated rank metric, its extrapolated version and the LP solver. - Modifies
feature_importance.pyto calculate the metric and also adds more complete RBO calculation (all types of RBO apart from uneven extrapolation) when comparing orig vs. rlds, orig vs rand and orig vs lower. - Adds calculation of correlation matrix in
feature_importance.py - Adds
pulpto required libraries which will require a rebuild of the Docker image.
WIP:
- The solver is too slow at the moment when testing it on the framingham dataset and I am trying to figure out why. It is possible that the LP problem has too many variables.<\s>
@OscartGiles This PR adds pulp to the required libraries list. Can you check if I have made all the necessary changes in the code for that to work? And is it easy to update the environment we use in Azure to support this?
@gmingas - Just fixing the tests now but looks good. To get onto the VMs I can either redeploy the VM(s) or we can just install manually for now. Should be added next time a VM is deployed.
Thanks! Yes, I did add it manually to do the weekend runs. No need to redeploy now, we can wait until the next time they are deployed.
The Test pipeline run now fails because it tries to run the household_poverty stuff but doesn't have the data. Can we either grab the data in the makefile or tell it not to run the household poverty stuff when it runs the pipeline?
We can grab the data in the makefile using the Kaggle API but that would require adding an authentication token to the repo. I don't know if it is possible to do this is a secure way, probably not from a quick search but maybe someone has done this before?
The other option is to remove the household cleaning code from the makefile and run it manually whenever we need it after the data are added manually too.
We could set it as an environment variable (can save it as a secret on github for use in the CI pipeline). But then we also need to make sure it is an environment variable on all our VMs and make it clear in the README that you need a kaggle API token. Save you manually downloading it on the VMs though.
For ref https://github.com/Kaggle/kaggle-api#api-credentials