ML icon indicating copy to clipboard operation
ML copied to clipboard

Native Time Series and Forecasting Support (Sequence Learning)

Open andrewdalpino opened this issue 4 years ago • 10 comments

Time series analysis is a popular machine learning technique for forecasting trends of time-dependent variables such as stock price, GDP, and quarterly sales. Given the popularity (https://github.com/RubixML/RubixML/issues/35, https://github.com/RubixML/RubixML/issues/38, https://github.com/RubixML/RubixML/issues/40) and current lack of tooling within the PHP ecosystem, I propose adding native time series support as well as a new type of estimator class for forecasting time series datasets. This includes the following ...

  1. A datastructure extending Dataset for time series datasets that includes an additional index for timestamps
  2. An additional estimator type "Forecaster" to predict the next k values in a series

There should be no need to modify any of the public interfaces to integrate these features into the current architecture

Proposed initial Forecaster implementations:

  • ARIMA - AutoRegressive Integrated Moving Average (univariate)
  • VARMAX - Vector AutoRegressive Moving Average with eXogenous regressors (multivariate)

Open to comments

andrewdalpino avatar Nov 11 '19 22:11 andrewdalpino

Yes, I would very like those additions to the library. Thank you!

BasvanH avatar Nov 12 '19 06:11 BasvanH

Thanks for the input @BasvanH

Expanding on the aforementioned design outline ...

The TimeSeries dataset object will have additional sorting, filtering, etc. methods that operate on the timestamp column. These will be similar to how Labeled provides additional methods that operate on labels. The timestamp column will allow either homogeneous integer or DateTime object elements.

Since time series estimation often diverges when considering univariate vs the multivarate case, the TimeSeries dataset object will handle both cases simultaneously, simply by keeping track of the number of target variables (as already accomplished using the numColumns() method on the Dataset class). For example, a univariate TimeSeries dataset object has a single column, whereas a multivariate one has more than 1 column. It will be the responsibility of the estimator to check whether the incoming dataset is compatible.

As mentioned previously, the public Estimator API will not change with the introduction of the new estimator type. In the case of forecasters the output of the predict() method will be the estimation of the next value given the last value in a series. The interpretation of the dataset therefore is slightly different at inference than during training in which the dataset is interpreted as a both contiguous and atomic. During inference, each sample will be considered independently and the value will be interpreted as either the empirical or theoretical last value of a time series the user would like to start inferring from. Since forecasters are estimators at heart, they benefit from all the additional tooling such as meta-Estimators and the cross validation framework.

In addition, we will add the Forecaster interface allowing estimators to implement the forecast() method which, unlike predict() will estimate the next k values starting at a given offset. It is assumed that most forecaster types will implement the Forecaster interface as prediction (as defined above) is only a special case of forecasting where k=1. There are currently two prototypes for the forecast() method signature to consider. The first is borrowing the idea of start and end from the statsmodels library (see their predict API). The second idea is to use the timestamp of the TimeSeries dataset object as the start and then output the next k subsequent values. The differences look like this ...

public forecast(TimeSeries $dataset, $start, $end) : array

vs.

public forecast(TimeSeries $dataset, int $k) : array

So far I personally prefer the latter case

As with the Learner, Probabalistic, and Ranking interfaces, the Forecaster interface will also include the forecastSample() method to handle inference on single samples at a time.

Open to comments

andrewdalpino avatar Nov 13 '19 22:11 andrewdalpino

Update:

Since we are in a feature-freeze for the time being, this enhancement will be moved over to the Extras package for the time being and may be integrated into the main package after

andrewdalpino avatar Apr 11 '20 00:04 andrewdalpino

Hi! sorry for commenting on a closed issue.

The comment said that its moved to the Extras package, understandably, however is it that the idea will be moved there or is it already there?

Regardless I much appreciate all the hard work been put into RubixML, just curious. 😄

LasseRafn avatar Feb 12 '21 23:02 LasseRafn

Hello, I would also like to know the status here. I would like to test forecasting for an idea on my side

thank you

Rello avatar Mar 17 '21 12:03 Rello

Hello @LasseRafn and @Rello thanks for commenting, I'll give an update and we'll reopen this issue to keep the discussion going.

We haven't got around to implementing time-series in ML or Extras yet, although we have plenty of research planned in regards to sequence learning, we have no immediate plans to implement features at this time. Having that said, we're seeing an uptick in contributions, it's possible that someone from the community can take on this effort.

andrewdalpino avatar Mar 17 '21 22:03 andrewdalpino

Could simpler sequence implementation be faster to implement first?

For example, dataset:

[0,1,1,1,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,1,1,1,1,1,1,1,0,1,0,0,0,0]

I see in this data, that 1 is more likely to be followed by 1, and 0 is more likely to be followed by 0. The more 1 or 0 are in a row, the more likely next value to be the same. Maybe there are other patters too. If human can see this pattern, maybe ML could too (and state the confidence).

mindaugasdi avatar Aug 06 '21 11:08 mindaugasdi

Hi guys! Any news about this feature?

Thank you!

itrack avatar Sep 13 '22 09:09 itrack

Hi @itrack. There's still talk about implementing VAR (vector autoregression) and LSTM. Nothing material has come about yet though. It's not that there's not enough want for sequence learning but that we really don't have the resources right now. Hopefully, we can attract more interest from the community.

andrewdalpino avatar Sep 13 '22 22:09 andrewdalpino

Are there any new developments here in the meantime. I would also be interested in a time series forecast.

ThomasW69 avatar Jun 15 '23 06:06 ThomasW69