QUIPP-pipeline icon indicating copy to clipboard operation
QUIPP-pipeline copied to clipboard

Correlated rank similarity metric

Open gmingas opened this issue 4 years ago • 6 comments

This PR adds support for Jenning's and Sebastian's correlated rank similarity metric.

Changes:

  • Adds three new methods to the RankingSimilarity class of rbo.py. These implement the correlated rank metric, its extrapolated version and the LP solver.
  • Modifies feature_importance.py to calculate the metric and also adds more complete RBO calculation (all types of RBO apart from uneven extrapolation) when comparing orig vs. rlds, orig vs rand and orig vs lower.
  • Adds calculation of correlation matrix in feature_importance.py
  • Adds pulp to required libraries which will require a rebuild of the Docker image.

WIP:

  • The solver is too slow at the moment when testing it on the framingham dataset and I am trying to figure out why. It is possible that the LP problem has too many variables.<\s>

gmingas avatar Apr 06 '21 09:04 gmingas

@OscartGiles This PR adds pulp to the required libraries list. Can you check if I have made all the necessary changes in the code for that to work? And is it easy to update the environment we use in Azure to support this?

gmingas avatar Apr 08 '21 17:04 gmingas

@gmingas - Just fixing the tests now but looks good. To get onto the VMs I can either redeploy the VM(s) or we can just install manually for now. Should be added next time a VM is deployed.

OscartGiles avatar Apr 12 '21 08:04 OscartGiles

Thanks! Yes, I did add it manually to do the weekend runs. No need to redeploy now, we can wait until the next time they are deployed.

gmingas avatar Apr 12 '21 08:04 gmingas

The Test pipeline run now fails because it tries to run the household_poverty stuff but doesn't have the data. Can we either grab the data in the makefile or tell it not to run the household poverty stuff when it runs the pipeline?

OscartGiles avatar Apr 12 '21 09:04 OscartGiles

We can grab the data in the makefile using the Kaggle API but that would require adding an authentication token to the repo. I don't know if it is possible to do this is a secure way, probably not from a quick search but maybe someone has done this before?

The other option is to remove the household cleaning code from the makefile and run it manually whenever we need it after the data are added manually too.

gmingas avatar Apr 12 '21 09:04 gmingas

We could set it as an environment variable (can save it as a secret on github for use in the CI pipeline). But then we also need to make sure it is an environment variable on all our VMs and make it clear in the README that you need a kaggle API token. Save you manually downloading it on the VMs though.

For ref https://github.com/Kaggle/kaggle-api#api-credentials

OscartGiles avatar Apr 12 '21 09:04 OscartGiles