heat
heat copied to clipboard
Speed up `ht.percentile`
Feature functionality
ht.percentile()
performance seems disproportionately slow, even considering that it potentially requires sorting along the split axis.
This function should be heavily refactored or even reimplemented from scratch. I vaguely remember implementing it myself back in the day. I'm afraid to look into that code.
Additional context Code snippet to benchmark (I was running it on 4 processes on HDFML - on GPU it even deadlocks) based on @mrfh92 's preprocessing tutorial.
import heat as ht
import time
start = time.perf_counter()
X = ht.load_hdf5("/p/scratch/training2404/data/JPL_SBDB/sbdb_asteroids.h5",dataset="data",split=0, device="cpu")
end = time.perf_counter()
print(f"Loading data on {X.comm.size} ranks took {end-start} seconds")
if X.comm.rank == 0:
print(f"X is a {X.ndim}-dimensional array with shape{X.shape}")
print(f"X takes up {X.nbytes/1e6} MB of memory.")
start = time.perf_counter()
features_mean = ht.mean(X,axis=0)
end = time.perf_counter()
if X.comm.rank == 0:
print(f"Mean on {X.comm.size} ranks took {end-start} seconds")
start = time.perf_counter()
features_median = ht.percentile(X,50.,axis=0)
end = time.perf_counter()
if X.comm.rank == 0:
print(f"Median on {X.comm.size} ranks took {end-start} seconds")
Loading data on 4 ranks took 0.015628864999825964 seconds
X is a 2-dimensional array with shape(1317275, 9)
X takes up 47.4219 MB of memory.
Mean on 4 ranks took 0.004456439000023238 seconds
Median on 4 ranks took 62.82521499199993 seconds
As far as I understand, computing percentiles in a distributed setting might really pose a problem in terms of quite heavy communication as re-ordering might be necessary. An option might be to add a faster variant using sketching, i.e. drawing a "small" sample from the data set and computing percentiles for this sample.
Maybe https://arxiv.org/pdf/2004.08604.pdf helps, although this algorithms has been designed for 1d (?) data streams.
I have implemented a version of ht.percentile
that estimates the true percentile only on behalf of a random sketch of the entire data set. #1420
Branch features/1389-Speed_up_ht_percentile created!
This issue is stale because it has been open for 60 days with no activity.
This issue is stale because it has been open for 60 days with no activity.