eis_toolkit
eis_toolkit copied to clipboard
185 add mahalanobis similarity
Mahalanobis distance
The idea:
- Mahalanobis "distance" or similarity is standard deviation extended to multiple variables
- Main idea of this script is to measure similarity of all locations in a given geotiff to a (small) group of known mineral occurences.
- The similarity is measured compared to the averaged features of the mineral occurences, meaning that even the known mineral occurences do not get full similarity.
Inputs:
Files that can be used for testing can be found here https://seafile.utu.fi/d/766d85c069a540cf9931/
Geotiff with different geophysical measurements in each band
Known mineral occurences: current testing implementation takes csv that contains values sampled from the Geotiff. However this is just for alpha testing. Further revisions should just ask a shapefile, geopackage or similar that has the locations of the known mineral occurences. Then the geotiff values at these locations should be sampled. This reduces user work considerably
Outputs
printed:
- multivariate normality test results
- possible warnings
- 2 Geotiffs, that have the same geographical extent as the original geotiff
- Mahalanobis similarity in standard deviations. Because there are probably more than 2 variables, values over 2 are to be expected even for the known occurences.
- P-values. Users should be noted that p-values close to 1 mean high confidence in similarity, while values close to 0 mean dissimilarity. This can be confusing as usually low p-values are considered "good"
Notes:
- feature importance should be determined beforehand, and only relevant features should be used for actual modelling
- Mahalanobis assumes that the variables have multivariate normal distribution. The user should be warned if the given data does not follow that distribution. Current implementation uses Henze-Zirkler test for testing. However, there is no one best way to test multivariate normality, so optimally multiple tests should be implemented.
- Acceptable p-value of the multivariate tests should be given as an option to the user.
- The results of the multivarie normality test(s) should be printed to the user in any case
- Similarity values are measured in standard deviations
- Because Mahalanobis is statistical test, it also creates probability values, which is very useful because there are usually not very many known mineral occurence.
- High p-value (close to 1) means that point is similar to average of known occurences with strong confidence
- low p-value (close to 0) means that the point is not similar to the average of known occurences
- There are multiple ways to create covariance matrix, and these should be given as options to the user. maximum likelihood estimation (MLE) is common, but not robust to noise. Minimum covariance determinant estimation (MCD) is more robust and should be a default. Other possibilities could also be added in future
Author: Iiro Seppä, University of Turku
"Mahalanobis assumes that the variables have multivariate normal distribution."
- Which is basically never the case in this domain.
Should we add this to the toolkit under category such as experimental
or similar? That could be a compromise solution that considers limitations/weak applicability of the method in MPM but would not discard the whole tool. Any thoughts what to do with this @RichardScottOZ , @jtorppa , @nialov , @iiroseppa ?
Closing this PR now since it seems we do not intend to include this tool, at least for now.