pytorch-forecasting
pytorch-forecasting copied to clipboard
Decode groupIDs and targets
I use the column filename to indicate my different timeseries (see loader code at the bottom). During testing, I am trying to match predictions with predictions with the original dataset. All is going ok, except, the filename is now an integer, not a filename. How do I decode the original filename from the batch['x']['groups']?
Is groups not the info to get? And how to decode? Would that be in the dataset.get_parameters()
?
def test_step(self, batch, batch_idx):
x, y = batch
for i in range(0, len(x["encoder_cont"])):
row = [x['decoder_time_idx'][i][-1].tolist(), x['groups'][i][0].tolist(), x['decoder_target'][i][0].tolist(), preds[i].tolist()]
self.datarow.append(row)
Later on after all steps are completed, I convert to dataframe:
df = pd.DataFrame(self.datarows, columns=['idx', 'group', 'y', 'y_hat'])
Then I can match y_hats
per row based on matching idx
and group
in the original dataset.
Am I missing an easier way to do this?
This is my TimeSeries init:
self.training_dataset = TimeSeriesDataSet(
dataset[lambda x: x['idx'] <= training_cutoff],
time_idx='idx',
target="y",
group_ids=["filename"], # groups different time series
min_encoder_length=max_encoder_length,
max_encoder_length=max_encoder_length,
min_prediction_length=1,
max_prediction_length=max_prediction_length,
static_categoricals=[],
static_reals=[],
time_varying_known_categoricals=[], # if time shifted.
time_varying_known_reals=['ATR', 'Open', 'High', 'Low', 'Close', 'Volume', 'CC', 'Close_pct' , 'Volume_pct'], ## add other variables later on
time_varying_unknown_categoricals=['y'],
time_varying_unknown_reals=[], #list of continuous variables that change over time and are not know in the future
categorical_encoders={'filename': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=True), 'y': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=False)}, ## how are nans processed? there should be none.
# scalers= {"Close": None, "idx": None, 'Volume_pct':None}, #StandardScaler, Defaults to sklearn’s StandardScaler()
target_normalizer=pytorch_forecasting.data.encoders.NaNLabelEncoder(),
# target_normalizer=NaNLabelEncoder(),
add_relative_time_idx=False,
add_target_scales=False, ## what is this?
add_encoder_length=False,
allow_missing_timesteps=False, # does not allow idx missing
predict_mode = False #To get only last output
)
I found the magic function dataset.x_to_index()
, which allows me to get the idx and group from x.
I am still struggling to find a very simple map for my target categorical variable. In batch.y it is 0 1 2, whereas in the original dat it's a string (Buy/Sell/None). It seems like there should be a simple mapping saved somewhere...
For the record here is how I merge my original dataframe with predictions:
def test_step(self, batch, batch_idx):
x, y = batch #not sure if instead should be batch[0], batch[1]
y_hat = self(x)
loss = self.criterion(y_hat, y[0].squeeze(1))
self.log('test_loss', loss, batch_size=self.batch_size)
# convert to predictions
preds = torch.argmax(y_hat, axis = 1)
self.test_step_y_hats.append(preds)
self.test_step_ys.append(y[0].squeeze(1))
self.test_step_xs.append(x["encoder_cont"])
# save predictions for reconstruction
for i in range(0, len(x["encoder_cont"])):
row = [x['decoder_target'][i][0].tolist(), preds[i].tolist()]
self.datarow.append(row)
self.databatch_x.append(x)
Then in my main function I initiate the test phase and call this and merge with:
```
self.trainer.test(model = self.model, dataloaders=dataloader)
# get x, y, y_hat, mapping
datarows = self.model.get_datarows()
print('Number of datarows: ' + str(len(datarows)))
df = pd.DataFrame(datarows, columns=['y_orig', 'y_hat'])
# get mapping to groups for y
databatch_x = self.model.get_databatch_x()
id_file=[]
for xbatch in databatch_x:
mapping = dataset.x_to_index(xbatch)
id_file.append(mapping)
index_map = pd.concat(id_file, ignore_index=True) #one big map dataframe
# add y to map
index_map['y_orig'] = df['y_orig']
index_map['y_hat'] = df['y_hat']
print(df.head())
# merge with original data
all_data_preds = pd.merge(index_map, orig_data, on=['idx', 'filename'], how='inner')
print(all_data_preds.head(50))
I hope this helps someone save time.
Still looking for a quick way to get a dictionary of how my target is encoded. Inverse transforming didn't seem to work.
Update, after I get this merged df, I can decode the y labels back from int encoding to their original string by learning my own mapping. Still looking for where this dict is stored somewhere.
def get_target_mapping(self, df):
# Initialize an empty mapping dictionary
mapping_dict = {}
# Iterate through unique values in 'y'
for y_value in df['y'].unique():
# Find the corresponding 'y_hat' value
corresponding_y_hat = df[df['y'] == y_value]['y_orig'].iloc[0]
mapping_dict[y_value] = corresponding_y_hat
return mapping_dict
def apply_decode_target(self, df):
mapping_dict = self.get_target_mapping(df)
inverted_dict = {value: key for key, value in mapping_dict.items()}
df['predictions'] = df['y_hat'].map(inverted_dict)
return df
you can't put those variables in known reals by the way- that is for stuff like time encodings afaik