nxontology icon indicating copy to clipboard operation
nxontology copied to clipboard

Generating a numpy array of pairwise similarity scores

Open dhimmel opened this issue 1 year ago • 0 comments

Here's some example code to generate pairwise lin similarity scores for all node pairs in an NXOntology. For now just posting in case it's helpful, but it's also possible we could create a function to populate a matrix with a similarity metric.

def generate_similarity_matrix(nxo: NXOntology[str]) -> npt.NDArray[np.float_]:
    nxo.freeze()
    nodes = list(nxo.graph.nodes)
    # ensure nodes are sorted, since matrix does not store row/column names
    assert sorted(nodes) == nodes
    similarity_array = np.zeros(shape=(nxo.n_nodes, nxo.n_nodes), dtype=np.float32)
    logging.info(
        f"Initialized {similarity_array.shape} array:\n{similarity_array[:5, :5]}"
    )
    # lin is symmetric, so we use combinations_with_replacement rather than product
    for (row, row_efo), (col, col_efo) in combinations_with_replacement(
        list(enumerate(nodes)), r=2
    ):
        similarity = nxo.similarity(row_efo, col_efo)
        similarity_array[row, col] = similarity.lin
        # only works for symmetric metrics
        similarity_array[col, row] = similarity.lin
    logging.info(f"Populated array with similarity:\n{similarity_array[:5, :5]}")
    return similarity_array  # type:ignore[return-value]


similarity_array = generate_similarity_matrix(nxo)
path = f"similarity-lin.npy.xz"
with fsspec.open(path, "wb", compression="infer") as write_file:
    np.save(write_file, similarity_array)

On EFO, saving as an XZ compressed npy file worked well. Scipy.sparse matrices can also be considered but can be slower (or faster) to work with.

dhimmel avatar Apr 03 '23 13:04 dhimmel