TimeSeries.max(axis=0), .min(axis=0) return a meaningless index
Describe the missing or desired feature
TimeSeries.max(axis=0) and TimeSeries.max(axis=0) return a single-step timeseries, with the TimeIndex corresponding to the first timestep of the original series. This is documented correctly but is not intuitive. This seems to be done for consistency with Timeseries.mean(axis=0), median(axis=0), etc. There also does not seem to be a TimeSeries equivalent of Series.idxmax or np.ndarray.argmax.
Describe proposed solution
Chage max() and min() to return a single-step timeseries with the index of the returned value. This would be a breaking change obviously, though I cannot immediately imagine code depending on the previous behavior. Alternatively, add a keyword argument to enable this behavior, defaulting to current behavior.
Describe potential alternatives
Or alternatively, add Timeseries.idxmax/min/etc.
Hi @eschibli , and thanks for raising this issue. I see your point, and do agree that it's misleading to even see a time index there.
This indeed makes sense for the univariate case. For the mulitvariate case however, it's not possible to return the correct time index for each component's max/min value. See below where "a" and "b" would have different idxmax (0 for "a", and 2 for "b").
>>> import pandas as pd
>>> from darts import TimeSeries
>>> df = pd.DataFrame({"a": [1,0,0], "b": [0,0,1]})
>>> series = TimeSeries.from_dataframe(df)
>>> series.max(axis=0)
<TimeSeries (DataArray) (time: 1, component: 2, sample: 1)> Size: 16B
array([[[1.],
[1.]]])
Coordinates:
* time (time) int64 8B 0
* component (component) object 16B 'a' 'b'
Dimensions without coordinates: sample
Attributes:
static_covariates: None
hierarchy: None
We cannot create a single-rowed TimeSeries from that case, since we need a time index with a well defined frequency (no missing dates).
As long as we can't find a coherent solution for all cases, I believe what we have at the moment is okayish.
Internally, we actually only use axis 1 and 2, since this will return a TimeSeries of the same time index as the input.
We allow axis=0 as it's possible to get the min/max values, but mention in the docs that the index will not represent the actual idxmin/idxmax.
For idxmin/idxmax I'd suggeset relying on other backends:
>>> series.pd_dataframe().idxmax()
component
a 0
b 2
dtype: int64
Why don't we add an index_from argument? This could default to None for current behavior, but accept an integer or string index for the component to which the index should correspond?
Hmm.. That would be a corner case that only applies for axis=0. What do you think @madtoinou ?
I have the impression that this kind of analysis is probably expected to be performed at the data-frame level, prior to the conversion to TimeSeries.
We could eventually improve the logic for max/min, as the corresponding time-step is well defined in the case of univariate series but for the multivariate one, I would leave things as is (since arbitrarily picking an index over another feels a bit too specific and actually not really meaningful for downstream forecasting for example) or eventually return a list of timeseries (one per component) in order to make is usable for other tasks but it feels a bit overly specific.
Alright, I understand the suggestion to use DataFrames for analysis (and I admit I initially didn't consider the case of multivariate TimeSerieses.) But on further reflection I still think the current behavior of returning the first timestep is too misleading. Users may not read documentation for a function as apparently-self-explainatory as .max or .min, so returning a plausible-but-wrong index for the monovariate case seems very dangerous. I would strongly prefer a change to avoid that.
We could either return the index of the first column's value, or alternatively return an index of integer zero. Returning a np.array seems more canonical but would be a breaking change obviously.