kalman-jax
kalman-jax copied to clipboard
set up on a new dataset and predict on new data points
hi @asolin @wil-j-wil !!
Really interested in the models and i am trying to set up the models on the new dataset.
can you please just review the below to see if what I'm doing makes sense.
import sys
sys.path.insert(0, '../')
import numpy as np
from jax.experimental import optimizers
import matplotlib.pyplot as plt
import time
from sde_gp import SDEGP
import approximate_inference as approx_inf
import priors
import likelihoods
from utils import softplus_list, plot
from sklearn.preprocessing import StandardScaler
plot_intermediate = False
import yfinance as yf
Y = np.array(yf.download("SPY", start="2008-01-01", end="2020-12-30")['Close'])
X=np.linspace(1,100,len(Y)).reshape(len(Y),1)
Y=Y.reshape(len(Y),1)
print('loading data ...')
#D = np.loadtxt('../../data/mcycle.csv', delimiter=',')
#X = D[:, 1:2]
#Y = D[:, 2:]
N = X.shape[0]
# Standardize
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(Y)
Xall = X_scaler.transform(X)
Yall = y_scaler.transform(Y)
# Load cross-validation indices
cvind = np.loadtxt('../experiments/heteroscedastic/cvind.csv').astype(int)
# 10-fold cross-validation setup
nt = np.floor(cvind.shape[0]/10).astype(int)
cvind = np.reshape(cvind[:10*nt], (10, nt))
np.random.seed(123)
fold = 0
# Get training and test indices
test = cvind[fold, :]
train = np.setdiff1d(cvind, test)
# Set training and test data
X = Xall[train, :]
Y = Yall[train, :]
XT = Xall[test, :]
YT = Yall[test, :]
plt.figure(1, figsize=(12, 5))
plt.clf()
plt.plot(X_scaler.inverse_transform(X), y_scaler.inverse_transform(Y), 'k.', label='train')
plt.plot(X_scaler.inverse_transform(XT), y_scaler.inverse_transform(YT), 'r.', label='test')
plt.legend()
plt.xlabel('time (milliseconds)')
plt.ylabel('accelerometer reading');
Hi @andrewcztrack
Glad to see you are trying things out. Everything looks OK except that you're using the cross-validation indices that we stored specifically for a different (smaller) data set. So you've truncated your data (I assume unintentionally).
To generate your own train/test split, you could use the code below:
# 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.split(ind_shuffled, 10)) # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold])
This splits your data into a 90% train / 10% test split. However, if you simply want to train on all the data and then make predictions at unseen locations, then you can just set XT to the locations you want to predict at.
Also note that the code is currently scaling and shifting your data, which may or may not be desirable, but is something to keep in mind.
Any other questions, let me know.
Will
Hi @wil-j-wil !!! thank you so much! you are so generous with your time!! so generally speaking i want the model to be trained and then predict into the future. So be trained on 90 days and predict forward for the next 10 days. Also to understand the cross validation experiment with your code below. Essentially two experiments with the models. I assume that it would be advantageous to standardise the values with X and Y values as it non stationary heteroscedastic data? Is my logic correct?
To note i tried to code below but I am getting an error.
import sys
sys.path.insert(0, '../')
import numpy as np
from jax.experimental import optimizers
import matplotlib.pyplot as plt
import time
from sde_gp import SDEGP
import approximate_inference as approx_inf
import priors
import likelihoods
from utils import softplus_list, plot
from sklearn.preprocessing import StandardScaler
plot_intermediate = False
import yfinance as yf
Y = np.array(yf.download("SPY", start="2008-12-01", end="2020-12-30")['Close'])
X=np.linspace(1,100,len(Y)).reshape(len(Y),1)
Y=Y.reshape(len(Y),1)
print('loading data ...')
#D = np.loadtxt('../../data/mcycle.csv', delimiter=',')
#X = D[:, 1:2]
#Y = D[:, 2:]
N = X.shape[0]
# Standardize
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(Y)
Xall = X_scaler.transform(X)
Yall = y_scaler.transform(Y)
# 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.array_split(ind_shuffled, 5)) # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold])
# Get training and test indices
#test = cvind[fold, :]
#train = np.setdiff1d(cvind, test)
# Set training and test data
X = Xall[train, :]
Y = Yall[train, :]
XT = Xall[test, :]
YT = Yall[test, :]
plt.figure(1, figsize=(12, 5))
plt.clf()
plt.plot(X_scaler.inverse_transform(X), y_scaler.inverse_transform(Y), 'k.', label='train')
plt.plot(X_scaler.inverse_transform(XT), y_scaler.inverse_transform(YT), 'r.', label='test')
plt.legend()
plt.xlabel('time (milliseconds)')
plt.ylabel('accelerometer reading');
[*********************100%***********************] 1 of 1 completed
loading data ...
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-68ca1431719f> in <module>
38 # 10-fold cross-validation setup
39 ind_shuffled = np.random.permutation(N)
---> 40 ind_split = np.stack(np.array_split(ind_shuffled, 5)) # 10 random batches of data indices
41 fold = 0
42 # Get training and test indices
<__array_function__ internals> in stack(*args, **kwargs)
~/miniconda3/envs/myenv1/lib/python3.8/site-packages/numpy/core/shape_base.py in stack(arrays, axis, out)
423 shapes = {arr.shape for arr in arrays}
424 if len(shapes) != 1:
--> 425 raise ValueError('all input arrays must have the same shape')
426
427 result_ndim = arrays[0].ndim + 1
ValueError: all input arrays must have the same shape
That error is because the data does not divide evenly into 5 batches. You could truncate the data slightly to fix it.
However, didn't you say that you wanted to train on the past and then predict into the future? In this case, you want to just set the first 90 days to be the training data and the last 10 to be test, so you don't need this random split any more.
Standardising the data might be fine, but just remember that this means the input will no longer be the exact time stamp.