darts icon indicating copy to clipboard operation
darts copied to clipboard

TimeSeries.max(axis=0), .min(axis=0) return a meaningless index

Open eschibli opened this issue 10 months ago • 5 comments

Describe the missing or desired feature TimeSeries.max(axis=0) and TimeSeries.max(axis=0) return a single-step timeseries, with the TimeIndex corresponding to the first timestep of the original series. This is documented correctly but is not intuitive. This seems to be done for consistency with Timeseries.mean(axis=0), median(axis=0), etc. There also does not seem to be a TimeSeries equivalent of Series.idxmax or np.ndarray.argmax.

Describe proposed solution Chage max() and min() to return a single-step timeseries with the index of the returned value. This would be a breaking change obviously, though I cannot immediately imagine code depending on the previous behavior. Alternatively, add a keyword argument to enable this behavior, defaulting to current behavior.

Describe potential alternatives Or alternatively, add Timeseries.idxmax/min/etc.

eschibli avatar Feb 25 '25 19:02 eschibli

Hi @eschibli , and thanks for raising this issue. I see your point, and do agree that it's misleading to even see a time index there.

This indeed makes sense for the univariate case. For the mulitvariate case however, it's not possible to return the correct time index for each component's max/min value. See below where "a" and "b" would have different idxmax (0 for "a", and 2 for "b").

>>> import pandas as pd
>>> from darts import TimeSeries
>>> df = pd.DataFrame({"a": [1,0,0], "b": [0,0,1]})
>>> series = TimeSeries.from_dataframe(df)
>>> series.max(axis=0)
<TimeSeries (DataArray) (time: 1, component: 2, sample: 1)> Size: 16B
array([[[1.],
        [1.]]])
Coordinates:
  * time       (time) int64 8B 0
  * component  (component) object 16B 'a' 'b'
Dimensions without coordinates: sample
Attributes:
    static_covariates:  None
    hierarchy:          None

We cannot create a single-rowed TimeSeries from that case, since we need a time index with a well defined frequency (no missing dates).

As long as we can't find a coherent solution for all cases, I believe what we have at the moment is okayish.

Internally, we actually only use axis 1 and 2, since this will return a TimeSeries of the same time index as the input. We allow axis=0 as it's possible to get the min/max values, but mention in the docs that the index will not represent the actual idxmin/idxmax.

For idxmin/idxmax I'd suggeset relying on other backends:

>>> series.pd_dataframe().idxmax()
component
a    0
b    2
dtype: int64

dennisbader avatar Feb 28 '25 08:02 dennisbader

Why don't we add an index_from argument? This could default to None for current behavior, but accept an integer or string index for the component to which the index should correspond?

eschibli avatar Feb 28 '25 17:02 eschibli

Hmm.. That would be a corner case that only applies for axis=0. What do you think @madtoinou ?

dennisbader avatar Mar 01 '25 11:03 dennisbader

I have the impression that this kind of analysis is probably expected to be performed at the data-frame level, prior to the conversion to TimeSeries.

We could eventually improve the logic for max/min, as the corresponding time-step is well defined in the case of univariate series but for the multivariate one, I would leave things as is (since arbitrarily picking an index over another feels a bit too specific and actually not really meaningful for downstream forecasting for example) or eventually return a list of timeseries (one per component) in order to make is usable for other tasks but it feels a bit overly specific.

madtoinou avatar Mar 03 '25 10:03 madtoinou

Alright, I understand the suggestion to use DataFrames for analysis (and I admit I initially didn't consider the case of multivariate TimeSerieses.) But on further reflection I still think the current behavior of returning the first timestep is too misleading. Users may not read documentation for a function as apparently-self-explainatory as .max or .min, so returning a plausible-but-wrong index for the monovariate case seems very dangerous. I would strongly prefer a change to avoid that.

We could either return the index of the first column's value, or alternatively return an index of integer zero. Returning a np.array seems more canonical but would be a breaking change obviously.

eschibli avatar Mar 03 '25 17:03 eschibli