StockPrediction
StockPrediction copied to clipboard
Hobby project in time series analysis with stock market data by comparing different methods for time series prediction, including ARIMA, SARIMAX, LSTM, and transformer.
Description
This hobby project compares different methods for time series prediction on stock market data. The methods used are- Autoregressive integrated moving average (ARIMA) model built from scratch
- Implemented SARIMAX model
- LSTM model
- Attention-based Transformer model.
The aim of the project is to demonstrate how build and use these different models on time series and show their efficiency on it.
Let's first begin with some analysis of the data!:)
1. Data analysis
Some questions of interest in time series analysis are
- What are the overall behavioural trend?
- Is there any recurrent seasonality involved?
- How much noise is there?
In Analysis.py we use the statsmodels API to analyse these questions. The results of this for the Amazon stock data (AMZN) are the following

As seen, there are some clear seasonal behaviour in the data. This is unwanted, and therefore, the data is made stationary. This is done by logarithmic differencing as
where i is the differencing range, here set to 12. To check that the data is stationary after the transformation, the Augmented Dickey-Fuller test is performed. The result of the transformation is seen in the following
where it is seen that the data has both stationary mean and stationary variance.
The inverse transformation to obtain the original values from the stationary data is
2. ARIMA model
The first model to be used for the predictions is the Autoregressive integrated moving average (ARIMA) model in ARIMA.py. The ARIMA model consists of first performing regression on the variable of interest prior (lagged) values where the regression error are linear combination of error terms of recurrent occurrence in the past. The formula for the ARIMA model is
where Y are the data values, r are the autoregressive parameters, M are the parameter of the moving average and e are the error terms. Furthermore, p' is the number of time lags of the autoregressive model and q is the order of the moving average model.
The test results of the ARIMA model is

and the density plot of the residuals are

3. SARIMAX
The second model is the Seasonal Autoregressive Integrated Moving Average Exogenous (SARIMAX) model which is implemented from the statsmodels library. From the pmdarima library, the most optimal values of the orders p, q and d for the SARIMAX model can be conducted approximately. The test result of the SARIMAX model with orders (p, q, d) = (1, 1, 1) on the Amazon stock is

4. LSTM
Now to the fun bits of machine learning! :D
4.1 Preprocessing
For the machine learning methods, some additional information about the data is added in form of technical indicators. The first indicator that is added is just the daily return which is the current closing price subtracted by the previous closing price. The second indicator is rate of change (ROC) which is the percentage of the daily change. The third indicator is the Williams %R indicator which measures overbought and oversold levels of the stock. Williams %R is calculated as
where HH is the highest high within the n previous period, C the closing price and LL the lowest low within the n previous period.
Next, the money flow index (MFI) indicator was computed which measures the buying and selling pressure based on the volume of the stock. It is computed as
where MFI is the money flow index, MFR is the raw money flow, TP is the typical price, V is the volume and H, L, C is the high, low and closing price.
Thereafter was the Ulcer index also used which is a volatility indicator that measures the downside risk in terms of depth and duration of price declines. The Ulcer index is computed as
where UI is the Ulcer index, SA is the squared average and PD is the percentage drawdown.
The consecutive indicator added was the average true range index which measures the market volatility. It measures this by decomposing the asset's price for a period n as
where H is the highest asset price of time t, L is the lowest asset price of time t, C is the closing price at time t and TR is the true range.
The penultimate and last indicators are just the simple moving average (SMA) and the exponential moving average (EMA).
4.2 Results
The result for using the LSTM model with the above indicators as additional input features is


5 Transformer
The original paper of the attention-based transformer model was published in 2017 and since then its popularity has explode. The model in transformer.py is the implementation from the original paper, but here, only the encoder of the transformer model is used. This attention-based model uses positional encoding whose values represent the importance of the different input features in each time step, i.e which features that the model should pay more attention to. This along with multihead scaled dot product layers with residual connections and feed-forward linear bottleneck creates the transformer encoder. The results, with the additional indicators as for the lstm, for the transformer model is

