hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Can I do Spatio-temporal clustering with HDBSCAN ?

Open Stephane987 opened this issue 6 years ago • 9 comments

Hi All,

I've a dataset with latitude, longitude and date-time. Is there a way to do a spatio-temporal clustering that includes the 3 features? Havresine is accepting only 2 features (lat lon in radians). I can scale/normalize the 3 features and use Euclidian, but I believe that I will lose the impact of the physical distance between points.

Any recommendation?

Thanks

Stephane987 avatar Jun 26 '19 00:06 Stephane987

The hard bit here is to work out how distance in seconds relates to distance in radians. Those are fundamentally two totally different scales. Any density based clustering technique is going to require a consistent distance between your points. If you had few enough points you could compute the Haversine distance between points and the temporal distance between your points and fold those together via a scaling that you felt was an appropriate between radian and seconds. Then you could hand that to hdbscan as a pre-computed distance matrix. Just remember that your scaling will change your clustering results.

If you wanted to go deeper down the rabbit hole of combining data with different concepts of distance on different sets of it's features then I'd suggest starting off by reading The 2016 paper "Generalized Low Rank Models" by Udell, et all. Don't be intimidated by the size of the paper it's a very smooth read with lots of examples.

jc-healy avatar Jul 03 '19 15:07 jc-healy

I have a time distance matrix to create a temporal clusterer. What's the best way to chain it with the Euclidean distance based clusterer to obtain spatio-temporal clustering?

gewitterblitz avatar Dec 04 '22 06:12 gewitterblitz

Nevermind, I figured it out.

gewitterblitz avatar Dec 04 '22 07:12 gewitterblitz

Nevermind, I figured it out.

Hi, @gewitterblitz , do you mind sharing code template for the time-distance matrix and your spatio-temporal clustering solution ?

Stephane987 avatar Dec 05 '22 01:12 Stephane987

@Stephane-Thales: I tried the following method but I am not sure if it's the best way.

import hdbscan
from scipy.spatial.distance import pdist

# data is of the form ['time','longitude','latitude']

def compute_squared_EDM(X):
    return squareform(pdist(X, metric='euclidean'))

hdb = hdbscan.HDBSCAN(metric='precomputed',cluster_selection_epsilon=20,min_cluster_size=5)
n, m = data.shape
timeDisMat=compute_squared_EDM(data[:,0].reshape(n, 1)) 
yy = hdb.fit(timeDisMat)
yy.cluster_selection_epsilon = 1e-5
yy.min_cluster_size = 5
disMat = compute_squared_EDM(data[:,1:3])
zz = yy.fit(disMat)

FWIW: This did not help me much because my data had some points too close to each other in space but distant in time and the hdbscan output still clustered the time-distant points into one cluster. Let me know if you have found something better.

gewitterblitz avatar Dec 05 '22 21:12 gewitterblitz

Hi @gewitterblitz ,

Is it possible that your 2nd .fit() is overwriting the 1st and all the part with timeDisMat is lost? DBScan does not have a partial_fit() feature only MiniBatchKMeans have one: https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning

Above, jc-healy mentioned:

compute the Haversine distance between points and the temporal distance between your points and fold those together via a scaling that you felt was an appropriate between radian and seconds.

So: a) finding an appropriate rescaling for the 2 matrixes timeDisMat and disMat b) fusing the matrix. In R there is a fuse function, can't find in python, so you can do sqrt(timeDisMat^2 + disMat^2) as squared Euclidean distances are additive according to here c) running the dbscan once

I've no time to try... but if you have, please share your result and code ?

Stephane987 avatar Dec 06 '22 01:12 Stephane987

Yes, you are right. The first just gets overwritten instead of acting on top of the first cluster classification.

I tried your suggestion but still don't get convincing results. It's most likely due to the uneven scaling of the time and distance dimensions. Not sure if there's an easy way to figure out the correct rescaling method.

I tried 'normalization' ((x-x.min)/(x.max-x.min)) and the StandardScaler function from sklearn.preprocessing but both of these didn't help much.

gewitterblitz avatar Dec 06 '22 01:12 gewitterblitz

Perhaps the data need to be normalized also if the shapes are too different ?

using StandardScaler ?

Stephane987 avatar Dec 06 '22 02:12 Stephane987