Pyraformer icon indicating copy to clipboard operation
Pyraformer copied to clipboard

Q: so for App flow dataset, the only feature is time?

Open mw66 opened this issue 1 year ago • 2 comments

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L19-L22

extract: time, weekday, hour, month

and is used here:

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L54-L57

I'm just wondering:

  1. why, for example, not using zone (convert to some integer) as extra features, and in that case, how does this model perform?

  2. or: if the train data only contains the single time feature (without weekday, hour, month), will this model still perform?

Sorry for the silly questions, want to hear your insight.

Thanks.

mw66 avatar Apr 14 '23 05:04 mw66

Hi,

  1. The information of 'zone' and 'app_name' is actually used, see https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L13 and https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L57. Each 'app_name' in each 'zone' corresponds to a time series, so we convert the 'app_name' and 'zone' information into an integer, namely, the 'seq_id'.
  2. It is also possible to make predictions based solely on historical time series. Following previous works, our implementation introduced these covariates.

Zhazhan avatar Apr 14 '23 06:04 Zhazhan

Ok, so the app_name and zone are there, but how about the previous value of the raw input sequence (inside the window size)?

Let's check the raw input sequence data, in: https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L17-L26

        single_df = grouped_data[i][1].drop(labels=['app_name', 'zone'], axis=1).sort_values(by="time", ascending=True)
        times = pd.to_datetime(single_df.time)
        single_df['weekday'] = times.dt.dayofweek / 6
        single_df['hour'] = times.dt.hour / 23
        single_df['month'] = times.dt.month / 12
        temp_data = single_df.values[:, 1:]    # L22, 'time' column is dropped here
        if (temp_data[:, 0] == 0).sum() / len(temp_data) > 0.2:
            continue

        all_data.append(temp_data)

we can see temp_data[:, 0] is the raw input sequence ('app_name', 'zone' are dropped on L17, and 'time' is dropped on L22, so temp_data[:, 0] is the 'value' in the original csv file.

Then, in https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L55

  single_data[:, 0] = seq_data.copy()

is the real raw input sequence data,

but in https://github.com/ant-research/Pyraformer/blob/master/data_loader.py#L513-L518

        cov = all_data[:, :, 1:]   # the real raw input sequence data 'value' (all_data[:, :, 0]) dropped?

        split_start = len(label[0]) - self.pred_length + 1
        data, label = split(split_start, label, cov, self.pred_length)

        return data, label

it's dropped from the training data?

That's my question: so the previous value of the raw input sequence value is not used at all in training?

mw66 avatar Apr 15 '23 18:04 mw66