umap icon indicating copy to clipboard operation
umap copied to clipboard

How should missing values be handled for Jaccard metric?

Open StaffanBetner opened this issue 2 years ago • 2 comments

My case is some roll call data where the members of parliament can be away (in a non-meaningful way) or just be on their seat temporarily, so that it is encoded as a missing value. However, I get errors due to missing data. Since the Jaccard metric itself is agnostic to the amount of available information, is there any way to handle this?

StaffanBetner avatar Mar 29 '23 08:03 StaffanBetner

Based on how Jaccard is defined I would code them as zeros. I presume, however, that you actually want to distinguish them from votes against -- which raises questions about what metric you should be using. It might actually make some sense to code votes for as 1, votes against as -1, and abstentions and absences as 0 and use cosine distance?

lmcinnes avatar Mar 29 '23 14:03 lmcinnes

I ended up precalculating a distance matrix and providing that, such that I only compare actual voting decisions i.e. not absences which are not meaningful in a Swedish context, in contrast to abstentions which have an intentional meaning. Here is my R code which may benefit someone else (I am using umap through reticulate):

# vectors a and b should be equal length, e.g. the full voting record of an individual
# this calculates the distances pairwise
jaccard_dist <- function(a, b) {
  if(length(a) != length(b)){stop("Unequal lengths")}
    intersection = sum(Vectorize(`==`)(a, b), na.rm = TRUE)
  union = length(na.omit(a))+length(na.omit(b)) - intersection
  output = 1-(intersection/union)
  return(output)
}

# to create a distance matrix
usedist::dist_make(dat_mat, jaccard_dist) -> dist_output

And here is ChatGPT's translation into Python 😀

import numpy as np
from scipy.spatial.distance import pdist, squareform

def jaccard_dist(a, b):
    if len(a) != len(b):
        raise ValueError("Unequal lengths")
    
    intersection = np.sum(np.equal(a, b), where=~np.isnan(a) & ~np.isnan(b))
    union = (len(a) - np.isnan(a).sum()) + (len(b) - np.isnan(b).sum()) - intersection
    output = 1 - (intersection / union)
    return output

def dist_make(dat_mat, distance_function):
    dist_output = squareform(pdist(dat_mat, metric=distance_function))
    return dist_output

# To create a distance matrix
# Replace 'data_matrix' with the actual data matrix you are working with.
dist_output = dist_make(data_matrix, jaccard_dist)

StaffanBetner avatar Mar 29 '23 14:03 StaffanBetner