Pyraformer
Pyraformer copied to clipboard
Q: so for App flow dataset, the only feature is time?
https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L19-L22
extract: time, weekday, hour, month
and is used here:
https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L54-L57
I'm just wondering:
-
why, for example, not using zone (convert to some integer) as extra features, and in that case, how does this model perform?
-
or: if the train data only contains the single time feature (without weekday, hour, month), will this model still perform?
Sorry for the silly questions, want to hear your insight.
Thanks.
Hi,
- The information of 'zone' and 'app_name' is actually used, see https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L13 and https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L57. Each 'app_name' in each 'zone' corresponds to a time series, so we convert the 'app_name' and 'zone' information into an integer, namely, the 'seq_id'.
- It is also possible to make predictions based solely on historical time series. Following previous works, our implementation introduced these covariates.
Ok, so the app_name
and zone
are there, but how about the previous value of the raw input sequence (inside the window size)?
Let's check the raw input sequence data, in: https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L17-L26
single_df = grouped_data[i][1].drop(labels=['app_name', 'zone'], axis=1).sort_values(by="time", ascending=True)
times = pd.to_datetime(single_df.time)
single_df['weekday'] = times.dt.dayofweek / 6
single_df['hour'] = times.dt.hour / 23
single_df['month'] = times.dt.month / 12
temp_data = single_df.values[:, 1:] # L22, 'time' column is dropped here
if (temp_data[:, 0] == 0).sum() / len(temp_data) > 0.2:
continue
all_data.append(temp_data)
we can see temp_data[:, 0]
is the raw input sequence ('app_name', 'zone' are dropped on L17, and 'time' is dropped on L22, so temp_data[:, 0]
is the 'value' in the original csv file.
Then, in https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L55
single_data[:, 0] = seq_data.copy()
is the real raw input sequence data,
but in https://github.com/ant-research/Pyraformer/blob/master/data_loader.py#L513-L518
cov = all_data[:, :, 1:] # the real raw input sequence data 'value' (all_data[:, :, 0]) dropped?
split_start = len(label[0]) - self.pred_length + 1
data, label = split(split_start, label, cov, self.pred_length)
return data, label
it's dropped from the training data
?
That's my question: so the previous value of the raw input sequence value is not used at all in training?