redflag
redflag copied to clipboard
Check for similar strings in categorical variables
Check edit distance to other unique values in low-cardinality categorical variables. (If high cardinality, might expect close labels, or might take too long to check all comparisons.)
E.g. to catch things like Sandstone / sandstone or shale / shales
Could also check for apparent abbreviations like fine sand / fs or limestone / limest.
Potentially useful library here https://github.com/life4/textdistance