Pyraformer Q: so for App flow dataset, the only feature is time?

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L19-L22

extract: time, weekday, hour, month

and is used here:

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L54-L57

I'm just wondering:

why, for example, not using zone (convert to some integer) as extra features, and in that case, how does this model perform?
or: if the train data only contains the single time feature (without weekday, hour, month), will this model still perform?

Sorry for the silly questions, want to hear your insight.

Thanks.

Apr 14 '23 05:04 mw66

Hi,

The information of 'zone' and 'app_name' is actually used, see https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L13 and https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L57. Each 'app_name' in each 'zone' corresponds to a time series, so we convert the 'app_name' and 'zone' information into an integer, namely, the 'seq_id'.
It is also possible to make predictions based solely on historical time series. Following previous works, our implementation introduced these covariates.

Apr 14 '23 06:04 Zhazhan

Ok, so the app_name and zone are there, but how about the previous value of the raw input sequence (inside the window size)?

Let's check the raw input sequence data, in: https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L17-L26

        single_df = grouped_data[i][1].drop(labels=['app_name', 'zone'], axis=1).sort_values(by="time", ascending=True)
        times = pd.to_datetime(single_df.time)
        single_df['weekday'] = times.dt.dayofweek / 6
        single_df['hour'] = times.dt.hour / 23
        single_df['month'] = times.dt.month / 12
        temp_data = single_df.values[:, 1:]    # L22, 'time' column is dropped here
        if (temp_data[:, 0] == 0).sum() / len(temp_data) > 0.2:
            continue

        all_data.append(temp_data)

we can see temp_data[:, 0] is the raw input sequence ('app_name', 'zone' are dropped on L17, and 'time' is dropped on L22, so temp_data[:, 0] is the 'value' in the original csv file.

Then, in https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L55

  single_data[:, 0] = seq_data.copy()

is the real raw input sequence data,

but in https://github.com/ant-research/Pyraformer/blob/master/data_loader.py#L513-L518

        cov = all_data[:, :, 1:]   # the real raw input sequence data 'value' (all_data[:, :, 0]) dropped?

        split_start = len(label[0]) - self.pred_length + 1
        data, label = split(split_start, label, cov, self.pred_length)

        return data, label

it's dropped from the training data?

That's my question: so the previous value of the raw input sequence value is not used at all in training?

Apr 15 '23 18:04 mw66

Pyraformer Pyraformer copied to clipboard

Q: so for App flow dataset, the only feature is time?

Pyraformer
Pyraformer copied to clipboard