rust-kmedoids icon indicating copy to clipboard operation
rust-kmedoids copied to clipboard

Add CLARA, FastCLARA, FasterCLARA

Open kno10 opened this issue 1 year ago • 0 comments

CLARA roughly does:

  • subsample the data
  • run PAM (FastCLARA: FastPAM, FasterCLARA: FasterPAM) on the sample
  • compute the total deviation on the entire data set for these medoids
  • return the best result found with multiple subsamples

This may seem like a trivial addition at first (and it would indeed only be a few lines in the Python wrapper) BUT:

  • this package currently does not include any distance functions, but operates on precomputed distance matrixes only
  • if you already have the distance matrix, just use FasterPAM and you will be fine
  • a meaningful implementation of these only computes the distance matrix on the subsample - which needs a data matrix as input and distance functions
  • for many users it will still be more convenient to handle the subset/sample within their own application

Hence a rough implementation plan would be

  • design an API for computing distances compatible with typical users (python wrapper, rust native users)
  • implement a decent choice of distance functions
  • implement CLARA
  • tests
  • update the Python wrapper

Adding distance function will also be necessary for CLARANS #6 BanditPAM #2 or coreset approaches #4

kno10 avatar Dec 11 '23 08:12 kno10