PaddleRec icon indicating copy to clipboard operation
PaddleRec copied to clipboard

Bug found in `Ali_Display_Ad_Click` dataset preprocessing, which has been used for DMR model reproduction.

Open zhujiem opened this issue 2 years ago • 1 comments

数据集目录“Ali_Display_Ad_Click”中显示从以下路径直接获取预处理后的数据https://github.com/PaddlePaddle/PaddleRec/blob/master/datasets/Ali_Display_Ad_Click/run.sh#L3

wget https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip

但该预处理数据的ID编码存在问题,具体表现为: 编码之后test set中仍包含未在train set中出现过的ID,可能原因为编码词典的统计不是只在train set中进行,导致test中出现的的新ID也在字典中。从而导致训练模型过程中,feature embedding的数量要比真实的要大,test阶段未训练到的ID embedding会以随机值的形式出现,会导致模型效果偏低。

以brand为例,统计brand_his和brand两个字段(这两个字段是统一编码),具体复现代码:

# 字段说明参看https://aistudio.baidu.com/aistudio/projectdetail/1805731 中“生成最终训练和测试数据集”标签页
train = pd.read_csv("work/train_sorted.csv", dtype=object)
train.fillna("0", inplace=True)
brand = train.iloc[:, 263].astype(int).values
brand_set = set(list(brand))
brand_his = train.iloc[:, 100:150].astype(int).values.flatten()
brand_his_set = set(list(brand_his))
brand_train = brand_set | brand_his_set
pd.DataFrame({"brand": sorted(list(brand_train))}).to_csv("train_brand.csv", index=False)

test = pd.read_csv("work/test.csv", dtype=object)
test.fillna("0", inplace=True)
brand = test.iloc[:, 263].astype(int).values
brand_set = set(list(brand))
brand_his = test.iloc[:, 100:150].astype(int).values.flatten()
brand_his_set = set(list(brand_his))
brand_test = brand_set | brand_his_set
pd.DataFrame({"brand": sorted(list(brand_test))}).to_csv("test_brand.csv", index=False)

print("Diff size:", len(brand_test - brand_train))
print(list(brand_test - brand_train)[0:50])

执行结果:

Diff size: 8048  # 即test中包含8048个新的brand id,未在train中出现,但进行了编码,分配了embedding空间
[163844, 32784, 360465, 65555, 360469, 426009, 26, 32795, 262171, 294941, 458783, 196646, 426022, 98345, 32814, 327727, 294971, 65599, 196672, 360511, 196682, 229457, 458846, 458851, 163940, 393317, 327783, 262250, 98414, 262255, 229489, 98422, 196727, 196729, 65665, 327809, 65667, 65676, 32909, 65678, 131215, 163983, 426126, 65682, 98457, 229539, 65700, 164008, 196776, 327848]

同样对cate_id和cate_his执行相同的代码得到:

Diff size: 101
[4096, 11264, 5639, 4105, 1547, 4107, 11275, 2066, 4116, 12314, 4124, 12323, 12324, 2597, 11813, 5677, 558, 2045, 50, 8243, 52, 3646, 3137, 11329, 68, 4164, 9292, 2665, 10351, 3185, 114, 7795, 4236, 3213, 2190, 4246, 12450, 12467, 2740, 8394, 3288, 3291, 9952, 9441, 2823, 11532, 10511, 10002, 3867, 12581]

zhujiem avatar Aug 04 '22 23:08 zhujiem

test set是可能会包含train set中未有的id,我理解这应该和原论文不冲突。

wangzhen38 avatar Aug 10 '22 10:08 wangzhen38