kalman-jax set up on a new dataset and predict on new data points

hi @asolin @wil-j-wil !!

Really interested in the models and i am trying to set up the models on the new dataset.

can you please just review the below to see if what I'm doing makes sense.

import sys
sys.path.insert(0, '../')
import numpy as np
from jax.experimental import optimizers
import matplotlib.pyplot as plt
import time
from sde_gp import SDEGP
import approximate_inference as approx_inf
import priors
import likelihoods
from utils import softplus_list, plot
from sklearn.preprocessing import StandardScaler

plot_intermediate = False

import yfinance as yf

Y = np.array(yf.download("SPY", start="2008-01-01", end="2020-12-30")['Close'])

X=np.linspace(1,100,len(Y)).reshape(len(Y),1)

Y=Y.reshape(len(Y),1)


print('loading data ...')
#D = np.loadtxt('../../data/mcycle.csv', delimiter=',')
#X = D[:, 1:2]
#Y = D[:, 2:]
N = X.shape[0]

# Standardize
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(Y)
Xall = X_scaler.transform(X)
Yall = y_scaler.transform(Y)

# Load cross-validation indices
cvind = np.loadtxt('../experiments/heteroscedastic/cvind.csv').astype(int)

# 10-fold cross-validation setup
nt = np.floor(cvind.shape[0]/10).astype(int)
cvind = np.reshape(cvind[:10*nt], (10, nt))

np.random.seed(123)
fold = 0

# Get training and test indices
test = cvind[fold, :]
train = np.setdiff1d(cvind, test)

# Set training and test data
X = Xall[train, :]
Y = Yall[train, :]
XT = Xall[test, :]
YT = Yall[test, :]

plt.figure(1, figsize=(12, 5))
plt.clf()
plt.plot(X_scaler.inverse_transform(X), y_scaler.inverse_transform(Y), 'k.', label='train')
plt.plot(X_scaler.inverse_transform(XT), y_scaler.inverse_transform(YT), 'r.', label='test')
plt.legend()
plt.xlabel('time (milliseconds)')
plt.ylabel('accelerometer reading');

Aug 25 '20 23:08 andrewcztrack

Hi @andrewcztrack

Glad to see you are trying things out. Everything looks OK except that you're using the cross-validation indices that we stored specifically for a different (smaller) data set. So you've truncated your data (I assume unintentionally).

To generate your own train/test split, you could use the code below:

# 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.split(ind_shuffled, 10))  # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold])

This splits your data into a 90% train / 10% test split. However, if you simply want to train on all the data and then make predictions at unseen locations, then you can just set XT to the locations you want to predict at.

Also note that the code is currently scaling and shifting your data, which may or may not be desirable, but is something to keep in mind.

Any other questions, let me know.

Will

Aug 26 '20 06:08 wil-j-wil

Hi @wil-j-wil !!! thank you so much! you are so generous with your time!! so generally speaking i want the model to be trained and then predict into the future. So be trained on 90 days and predict forward for the next 10 days. Also to understand the cross validation experiment with your code below. Essentially two experiments with the models. I assume that it would be advantageous to standardise the values with X and Y values as it non stationary heteroscedastic data? Is my logic correct?

To note i tried to code below but I am getting an error.

import sys
sys.path.insert(0, '../')
import numpy as np
from jax.experimental import optimizers
import matplotlib.pyplot as plt
import time
from sde_gp import SDEGP
import approximate_inference as approx_inf
import priors
import likelihoods
from utils import softplus_list, plot
from sklearn.preprocessing import StandardScaler

plot_intermediate = False

import yfinance as yf

Y = np.array(yf.download("SPY", start="2008-12-01", end="2020-12-30")['Close'])

X=np.linspace(1,100,len(Y)).reshape(len(Y),1)

Y=Y.reshape(len(Y),1)


print('loading data ...')
#D = np.loadtxt('../../data/mcycle.csv', delimiter=',')
#X = D[:, 1:2]
#Y = D[:, 2:]
N = X.shape[0]

# Standardize
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(Y)
Xall = X_scaler.transform(X)
Yall = y_scaler.transform(Y)


# 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.array_split(ind_shuffled, 5))  # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold])


# Get training and test indices
#test = cvind[fold, :]
#train = np.setdiff1d(cvind, test)

# Set training and test data
X = Xall[train, :]
Y = Yall[train, :]
XT = Xall[test, :]
YT = Yall[test, :]

plt.figure(1, figsize=(12, 5))
plt.clf()
plt.plot(X_scaler.inverse_transform(X), y_scaler.inverse_transform(Y), 'k.', label='train')
plt.plot(X_scaler.inverse_transform(XT), y_scaler.inverse_transform(YT), 'r.', label='test')
plt.legend()
plt.xlabel('time (milliseconds)')
plt.ylabel('accelerometer reading');




[*********************100%***********************]  1 of 1 completed
loading data ...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-68ca1431719f> in <module>
     38 # 10-fold cross-validation setup
     39 ind_shuffled = np.random.permutation(N)
---> 40 ind_split = np.stack(np.array_split(ind_shuffled, 5))  # 10 random batches of data indices
     41 fold = 0
     42 # Get training and test indices

<__array_function__ internals> in stack(*args, **kwargs)

~/miniconda3/envs/myenv1/lib/python3.8/site-packages/numpy/core/shape_base.py in stack(arrays, axis, out)
    423     shapes = {arr.shape for arr in arrays}
    424     if len(shapes) != 1:
--> 425         raise ValueError('all input arrays must have the same shape')
    426 
    427     result_ndim = arrays[0].ndim + 1

ValueError: all input arrays must have the same shape

Aug 27 '20 14:08 andrewcztrack

That error is because the data does not divide evenly into 5 batches. You could truncate the data slightly to fix it.

However, didn't you say that you wanted to train on the past and then predict into the future? In this case, you want to just set the first 90 days to be the training data and the last 10 to be test, so you don't need this random split any more.

Standardising the data might be fine, but just remember that this means the input will no longer be the exact time stamp.

Aug 28 '20 11:08 wil-j-wil

kalman-jax kalman-jax copied to clipboard

set up on a new dataset and predict on new data points

kalman-jax
kalman-jax copied to clipboard