tslearn icon indicating copy to clipboard operation
tslearn copied to clipboard

normalizing in Multi-Variate (MV) Time Series

Open NimaSarajpoor opened this issue 5 years ago • 0 comments

Hello everyone,

I have a question about normalizing (or standardizing) MV- time series data, and I think it is going to be a long post.

Let's start with a simple concept. If I have a simple dataset with p numerical features (which is not time series), we should better standardize its feature individually (i.e. (x-mu)/std) to make sure one feature doesn't dominate the others in calculating distances.

So, now in time series, let's say I have "MV- time series" data, with tslearn_shape as: (n_ts, sz, d). As explained by @rtavenar in the closed issue #142, tslearn library considers the dependent mode in calculating dtw distance of MV time series data.

Let's think like this: our data set has n_ts samples with d features: F_1, F_2, ...., F_d. So, in non-timeseries cases, we standardize each feature. Right?

How can we do it in time series for each feature? So, let's say we want to standardize F_1. right? We have n_ts with size sz each.
One way is:
Z = (cov)^(-1/2) * (X-mu) where, cov is the covariance_matrix of timeseries of F_1, X is each sample (which is a vector of size sz) in that feature, and mu is the vector mean (with size sz).

based on: https://www.statlect.com/probability-distributions/multivariate-normal-distribution#:~:text=Exercise%202-,The%20standard%20multivariate%20normal%20distribution,equal%20to%20the%20identity%20matrix.

But, I tried it and it doesn't make sense. Well, because the reseason might be the fact that such an approach assumes all "sz" variables in each time series are i.i.d which is not true for time series data.

(1) Is there any other way for standardizing time series data in each dimension? Or, does it even make sense to think like that? If not, how can we say if one dimension doesn't dominate the other ones?

(2) Should we simply divide the time series of each dimension to its maximum value in that dimension? (having assumed that the values are positive, it brings the range of value to [0,1])

(3) Or, we can simply reshape the time series of each dimension to a 1d array, standardize it, and reshape it back to (n_ts, sz) shape. (do it for each of d dimensions)? In this approach, the total std in each dimension affects the distance in MV- time series. Am I correct?

(4) This question is not directly related to this post, but it is related to the standardizing of "EACH SAMPLE" of time series data.

According to Dr. Keogh's paper: https://www.cs.ucr.edu/~eamonn/Data_Mining_Journal_Keogh.pdf And his response to a post here: https://datascience.stackexchange.com/questions/16034/dtw-dynamic-time-warping-requires-prior-normalization/16035,

So, if I understand correctly, they want to standardize EACH SAMPLE of time series. For instance, if we have 1-dim time-series data (n_ts, sz, 1), it takes each time series sample that has size sz (let's call it T) and standardizes it as below: T_znormalized = (T - mean(T))/std(T).

Am I right? If yes, does anyone know a paper that studied whether DTW-based analysis makes sense if we don't standardize the time series for cases where the magnitude matters? (e.g. in electricity demand, there might be a temporal shift in daily usage but we cannot simply standardize each daily pattern since the average (and the magnitude of usage) matters.

Let's say I have MV-timeseries data that has 2 dimensions and each dimension is the record of daily usage of one particular household for one year. So, my dataset's shape is: (365,24,2). Here is the important part: one household is small with peak usage of 10, and the other is big with peak usage of 100. Right? If I divide the time series of each dimension by its maximum value on that dimension and then cluster the data, two close points in the first dimension are close in their actual values, but such a small difference might be large in the actual values in the second dimension. So, does it mean that the standardizing approach might depend on the application at hand? (So, should I divide the whole dataset to the maximum value of the whole dataset, and says that's it?)

Sorry for such a long post. I Googled about these questions and read papers and posts on forums but I couldn't figure out. If you have a good paper that explains such topics, I would appreciate it if you could let me know its title.

Best, Nima

NimaSarajpoor avatar Sep 02 '20 19:09 NimaSarajpoor