How should missing values be handled for Jaccard metric?
My case is some roll call data where the members of parliament can be away (in a non-meaningful way) or just be on their seat temporarily, so that it is encoded as a missing value. However, I get errors due to missing data. Since the Jaccard metric itself is agnostic to the amount of available information, is there any way to handle this?
Based on how Jaccard is defined I would code them as zeros. I presume, however, that you actually want to distinguish them from votes against -- which raises questions about what metric you should be using. It might actually make some sense to code votes for as 1, votes against as -1, and abstentions and absences as 0 and use cosine distance?
I ended up precalculating a distance matrix and providing that, such that I only compare actual voting decisions i.e. not absences which are not meaningful in a Swedish context, in contrast to abstentions which have an intentional meaning. Here is my R code which may benefit someone else (I am using umap through reticulate):
# vectors a and b should be equal length, e.g. the full voting record of an individual
# this calculates the distances pairwise
jaccard_dist <- function(a, b) {
if(length(a) != length(b)){stop("Unequal lengths")}
intersection = sum(Vectorize(`==`)(a, b), na.rm = TRUE)
union = length(na.omit(a))+length(na.omit(b)) - intersection
output = 1-(intersection/union)
return(output)
}
# to create a distance matrix
usedist::dist_make(dat_mat, jaccard_dist) -> dist_output
And here is ChatGPT's translation into Python 😀
import numpy as np
from scipy.spatial.distance import pdist, squareform
def jaccard_dist(a, b):
if len(a) != len(b):
raise ValueError("Unequal lengths")
intersection = np.sum(np.equal(a, b), where=~np.isnan(a) & ~np.isnan(b))
union = (len(a) - np.isnan(a).sum()) + (len(b) - np.isnan(b).sum()) - intersection
output = 1 - (intersection / union)
return output
def dist_make(dat_mat, distance_function):
dist_output = squareform(pdist(dat_mat, metric=distance_function))
return dist_output
# To create a distance matrix
# Replace 'data_matrix' with the actual data matrix you are working with.
dist_output = dist_make(data_matrix, jaccard_dist)