heat icon indicating copy to clipboard operation
heat copied to clipboard

Speed up `ht.percentile`

Open ClaudiaComito opened this issue 11 months ago • 4 comments

Feature functionality ht.percentile() performance seems disproportionately slow, even considering that it potentially requires sorting along the split axis.

This function should be heavily refactored or even reimplemented from scratch. I vaguely remember implementing it myself back in the day. I'm afraid to look into that code.

Additional context Code snippet to benchmark (I was running it on 4 processes on HDFML - on GPU it even deadlocks) based on @mrfh92 's preprocessing tutorial.

import heat as ht
import time

start = time.perf_counter()
X = ht.load_hdf5("/p/scratch/training2404/data/JPL_SBDB/sbdb_asteroids.h5",dataset="data",split=0, device="cpu")
end = time.perf_counter()
print(f"Loading data on {X.comm.size} ranks took {end-start} seconds")

if X.comm.rank == 0:
    print(f"X is a {X.ndim}-dimensional array with shape{X.shape}")
    print(f"X takes up {X.nbytes/1e6} MB of memory.")


start = time.perf_counter()
features_mean = ht.mean(X,axis=0)
end = time.perf_counter()

if X.comm.rank == 0:
    print(f"Mean on {X.comm.size} ranks took {end-start} seconds")

start = time.perf_counter()
features_median = ht.percentile(X,50.,axis=0)
end = time.perf_counter()

if X.comm.rank == 0:
    print(f"Median on {X.comm.size} ranks took {end-start} seconds")
Loading data on 4 ranks took 0.015628864999825964 seconds
X is a 2-dimensional array with shape(1317275, 9)
X takes up 47.4219 MB of memory.
Mean on 4 ranks took 0.004456439000023238 seconds
Median on 4 ranks took 62.82521499199993 seconds

ClaudiaComito avatar Feb 29 '24 16:02 ClaudiaComito

As far as I understand, computing percentiles in a distributed setting might really pose a problem in terms of quite heavy communication as re-ordering might be necessary. An option might be to add a faster variant using sketching, i.e. drawing a "small" sample from the data set and computing percentiles for this sample.

Maybe https://arxiv.org/pdf/2004.08604.pdf helps, although this algorithms has been designed for 1d (?) data streams.

mrfh92 avatar Mar 11 '24 07:03 mrfh92

I have implemented a version of ht.percentile that estimates the true percentile only on behalf of a random sketch of the entire data set. #1420

mrfh92 avatar Apr 04 '24 08:04 mrfh92

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Jun 17 '24 02:06 github-actions[bot]

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Aug 26 '24 02:08 github-actions[bot]