A doubt about data augmentation
Thanks for your nice work, but the detail of data augmentation may have a leakage problem. More precisely, the pseudo-prior items may see the test information ahead of the inference.
def data_augment(model, dataset, args, sess, gen_num):
[train, valid, test, original_train, usernum, itemnum] = copy.deepcopy(dataset)
all_users = list(train.keys())
cumulative_preds = defaultdict(list)
for num_ind in range(gen_num):
batch_seq = []
batch_u = []
batch_item_idx = []
for u_ind, u in enumerate(all_users):
u_data = train.get(u, []) + valid.get(u, []) + test.get(u, []) + cumulative_preds.get(u, [])
if len(u_data) == 0 or len(u_data) >= args.M: continue
seq = np.zeros([args.maxlen], dtype=np.int32)
idx = args.maxlen - 1
for i in reversed(u_data):
if idx == -1: break
seq[idx] = i
idx -= 1
rated = set(u_data)
item_idx = list(set([i for i in range(itemnum)]) - rated)
batch_seq.append(seq)
batch_item_idx.append(item_idx)
batch_u.append(u)
The user data (i.e. ‘u_data = train.get(u, []) + valid.get(u, []) + test.get(u, []) + cumulative_preds.get(u, [])’) consist of the test data and used for generate the prior data. And the augmented data (i.e. prior data + train data + valid data) training the left-to-right model in the fine-tuning stage and the model to infer the rec result. So both augmented data and the left-to-right model see the test data(leakage of the test data) ahead of the inference.
Hi,
You might misunderstand the code. When we do data augmentation, the sequence is reversed, and the so-called "test" is the earliest interacted item in the normal form (without reverse).
Best.
Ziwei Fan
Thank for your reply. I think I have no misunderstanding code and the problem of data leakage did exist. Do not discuss the order problem, when the sequence is increased, all the user's interaction record has been used as original sequence to generate pseudo items, including the real test item to be predicted in the inferred stage. In addition, the so-called test items above is indeed the real test item, you may need to check the code. In short, I think the cause of such a good result is the disclosure of the data. When I solved the above problems, I didn't take good results. Looking forward to your reply again.
Hi,
When we do the data augmentation, the process does not involve any training process. In other words, the pre-training step does not use any ground-truth data, so there is no data leakage. In the data augmentation step, the reversed sequence is used, which is not matched with downstream normal order sequence prediction.
Best regards,
Ziwei Fan