submission-criteria Originality update/improvement

Originality update/improvement

Open cpurta opened this issue 6 years ago • 7 comments

Based on the suggested improvements in the pdf. I am now pull the tournament_data to allow the user submission data and other submissions to be broken down into the validation, test, and live data sets respectively. We can then perform the various independence tests on all 3 data sets and determine if the users submission is "original".

The update procedure is as follows:

Once the user submission is pulled to have the originality calculated we gather all the validation, test, and live probabilities into their own data sets.
We then go through the same procedure of pull all other submissions and comparing the user submission with those. But we will be also be gathering the other submissions probabilities into a validation, test and live set to compare against the users data sets.
Calculate a new independence score on the validation, test, and live data sets and if one is below a certain threshold it will fail originality.
Repeat the above for all other submissions and if it does not fail originality we can conclude that the submission is indeed "original"

Independence Test

The KS two sample test has been removed since it has proven itself to be vulnerable to ad-hoc modifications which allows a user to pass the originality test without necessarily being "original".

In place of this was going to be a Kullback-Leibler divergence entropy test. Since it should determine if two Probability Distribution Functions should behave similarly if the result approaches 0 and conversely if the result approaches 1. When testing this with some predictions from multiple models (NN, Linear Regression, GBTs) the results almost were always on the scale of 1.0e-7. Close to 0 so this measure was scraped due to almost all models having the same entropy.

In place of the entropy test I decided that perhaps a more simple measure may be better for an independence test. The test applies sorting both data sets to get an ascending curve and determining the normalized residual error between the curves as a measure of how similar they are. This seems to make a bit more sense as it applies how the mean difference relative to the maximum curve.

Also benchmarking the originality with this test takes on average 3.88ms against other submissions as opposed to the 8.86ms with the most recent version of master.

I am open to suggestions to either another/improving the independence test or to adding another. This one made the most sense to me after some research and testing.

Reward

Since this is PR does not cover all of the improvements laid out in the doc I think that if this is merged is worth 30-40 NMR since the only other improvements are to Concordance and the Staking tournament. Since it seems like the staking is more on the side of Numerai of generating new data with their keys for the encryption. It does not seems like that is worth much to those working on this project outside the company.

Numerai username: cpurta

Oct 04 '17 23:10 cpurta

I'm a bit confused by this test statistics (1.0 / n1) * (np.sum(np.absolute(data1 - data2)) / max(np.sum(data1), np.sum(data2))) * 10**math.floor(math.log10(n1)).

It seems like (1/n)*10**floor(log10(n)) should always be about 1, but it does oscillate depending on what n happens to be.

Since sum(data1) approx sum(data2), then it looks like (np.sum(np.absolute(data1 - data2)) / max(np.sum(data1), np.sum(data2))) is related to |cdf1 - cdf2|.

Would doing something like mean(|cdf1-cdf2|) make more sense? The KS test is essentially doing max(|cdf1-cdf2|). Using max might be easier to game, since you only need to change one of your predictions to change the max a lot.

I do like the idea of using an information theory type metric, but it is a bummer that KL divergence doesn't work. Maybe a difference type of divergence would work better? https://en.wikipedia.org/wiki/Bregman_divergence Has a pretty decent list of different types.

Oct 05 '17 18:10 zoso95

Yeah I ended up adding the 10**math.log(n1) since the score ended up being really small (1.0e-07) so multiplying by that ended up just being a bit more comparable to the thresholds being used. I think that you may be right in using the mean(|cdf1 - cdf2|) to get a better score. But I was doing some more research in KL-Divergence and I was not applying the divergence to a cdf, but rather just the sorted probabilities.

I have tested the KL-Divergence on the cdfs and have gotten better results against different models (xgboost vs regression) and have had the entropy measure show confirm that the models will behave differently (i.e. score -> inf as opposed to approaching 0) also I have tested the cheating mentioned in the improvements pdf and with the entropy measure shows that the models are extremely similar.

I have also updated the thresholds used for KL-Divergence testing since it quantifiably easier to determine if two models are going to behave similarly (i.e. the score is close to 0)

Oct 06 '17 00:10 cpurta

Per request in the slack, I'm leaving this feedback on this pull request as well.

In my opinion, any sort of distributional check isn't actually comparing how similar two sets of predictions are in any meaningful way.

If I generate a set of predictions, then shuffle them, a divergence test on the distributions of the original vs shuffled is going to say my predictions are identical, which is clearly false.

A check on originality should be required to maintain the same sorting between whatever vectors it's checking (e.g. correlation, and other distance/similarity metrics).

Oct 09 '17 18:10 mangstad

I agree with using another correlation and currently implementing the spearman correlation coefficient as another test with the pearson correlation that way we can determine if there exists any monotonic function between submissions.

Currently looking into some distance metrics to implement as another originality measure and thinking about using the mean canberra distance. I think that this could be useful since it should be robust against cheating by ad hoc modifications shown in the pdf. Any thoughts on whether this metric is useful or not are welcome.

Oct 09 '17 21:10 cpurta

@zoso95 I used your originality quality benchmark on my branch using the mean squared error and got the following results:

LogisticRegression 0.3900000000000002
NaiveBayes 0.4100000000000002
RandomForest 0.4000000000000002
QDA 0.4200000000000002
KNN 0.4200000000000002
SVC 0.3900000000000002
MLP 0.3900000000000002
AdaBoosting 0.3900000000000002
DecisionTree NAN

If I am following your code correctly those numbers are a measure of how much noise one would have to add on top of those baseline predictions in order to have them be original. If the threshold to be original is greater than 0.5 then it may as well be just random noise.

So the greater the threshold on those models the more robust the metrics is against gaming, correct?

Oct 17 '17 00:10 cpurta

Do we know what a reasonable threshold for spearman correlation is?

Oct 27 '17 00:10 zoso95

It seems like the threshold should be pretty high or on par with the pearson correlation. Since it is a measure of how two data sets can be described using a monotonic function.

Oct 27 '17 15:10 cpurta

submission-criteria submission-criteria copied to clipboard

Originality update/improvement

Independence Test

Reward

submission-criteria
submission-criteria copied to clipboard