Description

This hobby project compares different methods for time series prediction on stock market data. The methods used are

Autoregressive integrated moving average (ARIMA) model built from scratch
Implemented SARIMAX model
LSTM model
Attention-based Transformer model.

The aim of the project is to demonstrate how build and use these different models on time series and show their efficiency on it.

Let's first begin with some analysis of the data!:)

1. Data analysis

Some questions of interest in time series analysis are

What are the overall behavioural trend?
Is there any recurrent seasonality involved?
How much noise is there?

In Analysis.py we use the statsmodels API to analyse these questions. The results of this for the Amazon stock data (AMZN) are the following

Analysis

As seen, there are some clear seasonal behaviour in the data. This is unwanted, and therefore, the data is made stationary. This is done by logarithmic differencing as

$y'_t = \log{\left(\frac{y_t}{y_{t-1}} \right )}-y_{t-i}$

where i is the differencing range, here set to 12. To check that the data is stationary after the transformation, the Augmented Dickey-Fuller test is performed. The result of the transformation is seen in the following

Stationary

where it is seen that the data has both stationary mean and stationary variance.

The inverse transformation to obtain the original values from the stationary data is

$y_t = y_{t-1}e^{y'_{t}+y_{t-i}}$

2. ARIMA model

The first model to be used for the predictions is the Autoregressive integrated moving average (ARIMA) model in ARIMA.py. The ARIMA model consists of first performing regression on the variable of interest prior (lagged) values where the regression error are linear combination of error terms of recurrent occurrence in the past. The formula for the ARIMA model is

$Y_{t}-r _{1}Y_{t-1}-\dots -r _{p'}Y_{t-p'}=e _{t}+M _{1}e _{t-1}+\cdots +M _{q}e _{t-q}$

where Y are the data values, r are the autoregressive parameters, M are the parameter of the moving average and e are the error terms. Furthermore, p' is the number of time lags of the autoregressive model and q is the order of the moving average model.

The test results of the ARIMA model is

arimares

and the density plot of the residuals are

density

3. SARIMAX

The second model is the Seasonal Autoregressive Integrated Moving Average Exogenous (SARIMAX) model which is implemented from the statsmodels library. From the pmdarima library, the most optimal values of the orders p, q and d for the SARIMAX model can be conducted approximately. The test result of the SARIMAX model with orders (p, q, d) = (1, 1, 1) on the Amazon stock is

sarimax

4. LSTM

Now to the fun bits of machine learning! :D

4.1 Preprocessing

For the machine learning methods, some additional information about the data is added in form of technical indicators. The first indicator that is added is just the daily return which is the current closing price subtracted by the previous closing price. The second indicator is rate of change (ROC) which is the percentage of the daily change. The third indicator is the Williams %R indicator which measures overbought and oversold levels of the stock. Williams %R is calculated as

$WR = \frac{HH_{t-n} - C_t}{HH_{t-n}-LL_{t-n}}$

where HH is the highest high within the n previous period, C the closing price and LL the lowest low within the n previous period.

Next, the money flow index (MFI) indicator was computed which measures the buying and selling pressure based on the volume of the stock. It is computed as

$\begin{align*} MFI &= 100 - \frac{100}{1+MFR},\\ MFR &= \frac{\sum_{i=t-n}^{t}(RMF_i)_+}{\sum_{i=t-n}^t(RMF_{i})_-},\\ RMF &= TP \cdot V,\\ TP &= \frac{H + L + C}{3} \end{align*}$

where MFI is the money flow index, MFR is the raw money flow, TP is the typical price, V is the volume and H, L, C is the high, low and closing price.

Thereafter was the Ulcer index also used which is a volatility indicator that measures the downside risk in terms of depth and duration of price declines. The Ulcer index is computed as

$\begin{align*} UI_t &= \sqrt{SA_t},\\ SA_t &= \frac{1}{n}\sum_{i=t-n}^t (PD_i)^2,\\ PD_t &= 100 \hspace{1mm}\frac{C_t-HH_{t-n}}{HH_{t-n}} \end{align*}$

where UI is the Ulcer index, SA is the squared average and PD is the percentage drawdown.

The consecutive indicator added was the average true range index which measures the market volatility. It measures this by decomposing the asset's price for a period n as

$\begin{align*} ATR &= \frac{1}{n}\sum_{i=t-n}^tTR_i,\\ TR &= \max{\left(H-L, |H-C|, |L-C| \right )} \end{align*}$

where H is the highest asset price of time t, L is the lowest asset price of time t, C is the closing price at time t and TR is the true range.

The penultimate and last indicators are just the simple moving average (SMA) and the exponential moving average (EMA).

4.2 Results

The result for using the LSTM model with the above indicators as additional input features is

lstm

lstmz

5 Transformer

The original paper of the attention-based transformer model was published in 2017 and since then its popularity has explode. The model in transformer.py is the implementation from the original paper, but here, only the encoder of the transformer model is used. This attention-based model uses positional encoding whose values represent the importance of the different input features in each time step, i.e which features that the model should pay more attention to. This along with multihead scaled dot product layers with residual connections and feed-forward linear bottleneck creates the transformer encoder. The results, with the additional indicators as for the lstm, for the transformer model is

transformer

transformerz

StockPrediction
StockPrediction copied to clipboard

Metadata

Description

1. Data analysis

2. ARIMA model

3. SARIMAX