dtaidistance icon indicating copy to clipboard operation
dtaidistance copied to clipboard

Support for cdist

Open bilelomrani1 opened this issue 2 years ago • 7 comments

Currently it is possible to compute a distance matrix between all pairs of time series within a collection with dtaidistance.dtw_ndim.distance_matrix, but it is not possible to (efficiently) compute distances between each pair of two collections of inputs, something akin to scipy.spatial.distance.cdist but for DTW.

This can be really useful to efficiently compute distances between multiple time series and a collection of prototypes.

bilelomrani1 avatar Aug 08 '22 19:08 bilelomrani1

Indeed, that specific setting was not considered when designing this toolbox. While it is a good suggestion, this would require quite some refactoring. You probably already thought of this, but our goto solution in case of prototypes would be to parallelize over the prototypes manually (see e.g. the k-means implementation).

Two simpler alternatives could be to use either numpy concat or to make the toolbox think it sees one list. And then use the block argument. Both are compatible with the fast C implementation.

Assume two sets of series you want to compare:

nb = 3
size = 10
np.random.seed(seed=3980)
s1 = np.random.random((nb,size))
s2 = np.random.random((nb,size))

First alternative (with the disadvantage that it requires copying all data to one data structure):

s = np.vstack((s1, s2))
m = dtaidistance.dtw.distance_matrix(s, block=((0,nb), (nb,2*nb)),
                                     compact=False, use_c=True)

Second alternative:

class MyList(list):
    def __init__(self, s1, s2):
        self.s1 = s1
        self.s2 = s2
    def __len__(self):
        return len(self.s1) + len(self.s2)
    def __getitem__(self, idx):
        if idx < len(self.s1):
            return self.s1[idx]
        idx -= len(self.s1)
        return self.s2[idx]
s = MyList(s1, s2)
m = dtaidistance.dtw.distance_matrix(s, block=((0,nb), (nb,2*nb)),
                                     compact=False, use_c=True)

The overhead of the second solution should not be too bad. The additional memory requirements are a lot less than storing the actual data. If this second solution works well, we could consider adding this with some wrappers to make it easy to use different series. But some benchmarking would be necessary first.

wannesm avatar Aug 11 '22 22:08 wannesm

Thank you for this extensive answer, adding a wrapper sounds like a good idea. I'm leaving the issue opened just in case, but feel free to close it if you want. Thank you for your help.

bilelomrani1 avatar Aug 12 '22 11:08 bilelomrani1

How would one properly format a matrix of (lat, long) coordinates for the dtw.distance_matrix_fast with block? Let's say, I have a bunch of same-length series in a numpy matrix of size (num_trajectories, time_steps, 2). Unsure of how to adapt the above examples. Thanks!

jenghub avatar Sep 12 '22 15:09 jenghub

In this case you can probably use this functionality directly (assuming the series you want to use are consecutive): https://dtaidistance.readthedocs.io/en/latest/usage/dtw.html#dtw-between-multiple-time-series-limited-to-block

In case the series you want to compare are scattered in your numpy matrix, you need to reorder. Suppose you want to compare series 0,2,4 with 1,3,4, this would be something like:

series_reordered = series[[0,2,4,1,3,5]]
dtw_ndim.distance_matrix_fast(series_reordered, block=((0,3),(3,6)))

Unrelated to DTW, using Euclidean distance (used within DTW) for lat/lon might give suboptimal results. For example, one degree latitude is about 111km, but on degree longitude ranges from being 111km to 0km depending where you are on earth. If you observe strange results, consider re-projecting to a two-dimensional plane (e.g. Mercator): https://geopandas.org/en/stable/docs/user_guide/projections.html

wannesm avatar Sep 12 '22 18:09 wannesm

Thanks for the advice! Just to confirm, would I have to reshape a my array of shape (examples, time_step, 2) to (examples, time_step * 2)? The initial 3d matrix returns zeroes, and the second 2d matrix does return valid numbers when I use dtw.distance_matrix_fast with equal-sized arrays. I just want to make sure the lat, long (in whatever projection) is being properly ordered/used by the distance function.

jenghub avatar Sep 12 '22 21:09 jenghub

You should not reshape, but use the multi-dimensional version available in the dtw_ndim subpackage (thus not the one dimensional version in dtw). See also https://dtaidistance.readthedocs.io/en/latest/usage/dtw.html#multi-dimensionsal-dtw

wannesm avatar Sep 13 '22 12:09 wannesm

Gotcha. Have done that, but was wondering if there was to have the matrix function work with 2d. All good. Thank you!

jenghub avatar Sep 13 '22 15:09 jenghub