xskillscore
xskillscore copied to clipboard
Feature request: ttest_ind
Adding two tailed test? https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
would this be different from https://esmtools.readthedocs.io/en/latest/api/esmtools.testing.ttest_ind_from_stats.html?
I think there are many tests to be added for xskillscore in theory, but is the T-test applicable to forecasting and hence xskillscore?
I think my use case is relevant to forecasting (seeing whether the forecast anomaly values are statistically significantly different from some reference values), but I wasn't aware that xskillscore is only limited to forecasting; I thought it was simply a collection of scores for use with xarray/dask. In fact, I think that ttest_ind_from_stats in esmtools would be more better suited for xskillscore than esmtools because ttest_ind isn't just limited to ESM analysis (thus in scipy). Also ttest_ind is similar to pearson_r_p_val in my opinion, but instead of doing statistical significance testing on correlation, it'd be doing it on differences.
I think my use case is relevant to forecasting (seeing whether the forecast anomaly values are statistically significantly different from some reference values), but I wasn't aware that xskillscore is only limited to forecasting; I thought it was simply a collection of scores for use with xarray/dask. In fact, I think that ttest_ind_from_stats in esmtools would be more better suited for xskillscore than esmtools because ttest_ind isn't just limited to ESM analysis (thus in scipy).
We are entering the confusing web of scope creep :) My view is:
esmtoolsis really just my personal package of nice methods I find myself using frequently. That happens to be focused on Earth System analysis, so I called it that. It's a junk drawer for functions though.xskillscoretagline is " Metrics for verifying forecasts ", so it should focus on statistical metrics for forecast verification. It can't really enter the territory of true skill scores, since you need three inputs: (a) the dynamical forecast, (b) the verification data, (c) the reference forecast (e.g. persistence or climatology). To get BSS for instance you need to runxs.brier_score(a, b)andxs.brier_score(c,b)and then derive from there. It could be done, but seems like scope creep.climpredis a high-level package for forecast verification. It handles reference forecasts, bias correction, etc. That's where things like BSS should live, which will pull in low-level metrics fromxskillscore.
But yes my question would be if you're looking for statistical significance that isn't bootstrapping, why not pearson_r_p_value? It uses the t-test there to get your pvalue. However, I think you can derive the critical correlation coefficient (i.e. the minimum corr needed for significance) by a simple manipulation of the t index. So I would be for adding it here. In the end, you'd get a gridded t-test value for a comparison of two gridded results. I don't care if there's redundancy here and esmtools because obviously xskillscore is used much more widely than my personal package.
Since I am not calculating any correlations, I don't think pearson_r_p_value is relevant, but I could be wrong.
For example, for pearson_r/_p_value, I commonly use it for anomaly correlations, e.g. (forecast - climatology) correlated against (analysis - climatology)
In this case, I don't have climatology, I simply have (forecast - reference) and I think t_test would be more suitable here?
I could be wrong though~
xskillscoretagline is " Metrics for verifying forecasts ", so it should focus on statistical metrics for forecast verification. It can't really enter the territory of true skill scores, since you need three inputs: (a) the dynamical forecast, (b) the verification data, (c) the reference forecast (e.g. persistence or climatology). To get BSS for instance you need to runxs.brier_score(a, b)andxs.brier_score(c,b)and then derive from there. It could be done, but seems like scope creep.
At the same time the package name is xskillscore so should I trust the package name or tagline more hah; maybe xskillscore can be renamed to xscore or xverif similar to how xarray was renamed from xray.
Anyways, I don't think t-test ind is a skill score, so I think it could be added here similar to pearsonr p-value
To add to the convo i'm seeing tests appear in the ML space e.g. https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.stat
I'm fine with adding these tests as i'm personally interested in extending this package for use in ML workloads.
I see this repo as a host of ufuncs which can easily be applied to xarray objects (and pd.DataFrame.to_xarray()). Where the host of ufuncs is within the realm of how different/similar are two distributions (this started with deterministic metrics: correlation (and p-value), RMSE, etc. then we added probabilistic and recently contingency). I'm fine with adding stat tests which I imagine can be slotted here https://github.com/raybellwaves/xskillscore/blob/master/xskillscore/tests/test_metric_results_accurate.py#L28
We just recently officially modified the climpred tagline to broaden our scope to: "Verification of weather and climate forecasts." We could also broaden this tagline to be general. Something like: "Vectorized statistical assessments for forecasts and machine learning workloads."
The point @raybellwaves is to use these as minimization functions for training an ML model for instance? And then of course the testing like ttest_ind and https://github.com/raybellwaves/xskillscore/pull/176 can evaluate different model performances.
Moving tagline discussion to https://github.com/raybellwaves/xskillscore/issues/180
testing like ttest_ind and #176 can evaluate different model performances
Yes I like the test module @aaronspring created in https://github.com/raybellwaves/xskillscore/pull/176
So to clarify, is it okay to add add ttest_ind to xskillscore?
I view xskillscore as my one stop for all xarray scoring~
@ahuang11, yes, I would add it to the new test.py module that @aaronspring is developing in #176. Another thing I have code for that is useful is a fisher z transformation to test statistical significance between two different grids of correlations.
I've been testing xr.apply_ufunc scipy.ttest_ind and I feel like the runtime is kind of slow, especially if applying it for each lat/lon. Maybe could try rewriting in numpy.
moved to https://github.com/xarray-contrib/xskillscore/issues/284 leaving open
see https://github.com/dougiesquire/xstatstests
I'm happy to port xstatstests into xskillscore if we think that's appropriate. It didn't feel right to me when I first thought about it as they aren't skill scores, but maybe others feel differently?
I think it’s fine to have themseparate. We have very few tests which are all related to one forecast being better than another here only.