DeepMatch 例子中电影数据是哪个版本啊？有没有更多数据啊

hi，dear In the code, could you tell me the data from where? want more data to have a try, thx

Apr 15 '20 03:04 ucas010

please refer to #9

Apr 15 '20 05:04 shenweichen

多谢大佬。另外有个问题，在脚本中选择的稀疏特征怎么没有genres和rating

sparse_features = ["movie_id", "user_id",
                    "gender", "age", "occupation", "zip", ]

这个特征选择有什么讲究吗？请指点下，多谢

Apr 30 '20 03:04 ucas010

这个因为genres是多值特征，所以暂时没有加入。另外rating是label，所以没有作为特征使用

Apr 30 '20 06:04 shenweichen

请教下，编码后的为啥需要+1？

for feature in features:
    lbe = LabelEncoder()
    data[feature] = lbe.fit_transform(data[feature]) + 1
    feature_max_idx[feature] = data[feature].max() + 1

thx

May 04 '20 08:05 ucas010

另外在preprocess.py中将没有看过的视为负样本是否合适？？常见是将评分低的做负样本

May 05 '20 06:05 ucas010

另外数据集的构造不太懂，每个用户看过的为何选出来其中的1~len个，并将当前电影的评分作为这一些列电影的评分吗？？？有点蒙蔽 train_set.append((reviewerID, hist[::-1], pos_list[i], 1,len(hist[::-1]),rating_list[i]))

May 05 '20 08:05 ucas010

train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)
test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)

看了这个函数的源码发现rating根本没有用到，不知道为啥。

train_set.append((reviewerID, hist[::-1], pos_list[i], 1,len(hist[::-1]),rating_list[i]))
train_label = np.array([line[3] for line in train_set])

label有行为的是1，没有行为的是0，评分没有用到。

May 06 '20 01:05 ucasiggcas

在这一行代码中，

input_from_feature_columns(user_features,user_feature_columns, l2_reg_embedding, init_std, seed,
                                                                                  embedding_matrix_dict=embedding_matrix_dict)

embedding_matrix_dict=embedding_matrix_dict 这句是没有必要的吧，没有这个关键词参数这是个很大的bug，input_from_feature_columns函数中

##    embedding_matrix_dict = create_embedding_matrix(feature_columns, l2_reg, init_std, seed, prefix=prefix,
##                                                    seq_mask_zero=seq_mask_zero)

这个应该去掉，再增加这个关键字参数。

May 08 '20 06:05 ucasiggcas

请教下这个loss的依据是啥啊，怎么y_true没有参与啊？有点奇怪啊

def sampledsoftmaxloss(y_true, y_pred):
    return K.mean(y_pred)

May 08 '20 10:05 ucasiggcas

在用户特征中为啥有个hist_len这个特征？历史观看的movie长度也是个特征吗？？

>>> build_input_features(user_feature_columns)
OrderedDict([('user_id', <tf.Tensor 'user_id:0' shape=(?, 1) dtype=int32>), ('gender', <tf.Tensor 'gender:0' shape=(?, 1) dtype=int32>), ('age', <tf.Tensor 'age:0' shape=(?, 1) dtype=int32>), ('occupation', <tf.Tensor 'occupation:0' shape=(?, 1) dtype=int32>), ('zip', <tf.Tensor 'zip:0' shape=(?, 1) dtype=int32>), ('hist_movie_id', <tf.Tensor 'hist_movie_id:0' shape=(?, 50) dtype=int32>), ('hist_len', <tf.Tensor 'hist_len:0' shape=(?, 1) dtype=int32>)])

另外这些特征都是单个值的，假如有多标签或者说多值的话怎么办呢？

May 09 '20 08:05 ucasiggcas

召回的计算是不是有问题啊？我打印了下，结果都是0和1，如下： [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0.....] 当然取mean后0.27，看了其定义函数：

def recall_N(y_true, y_pred, N=50):
    return len(set(y_pred[:N]) & set(y_true)) * 1.0 / len(y_true)

这意思应该是类别的召回吧，

May 10 '20 15:05 ucas010

请教下，计算得到的user和item的vector是按照lbe编码的顺序的吗？能对应上吗？

Jun 08 '20 12:06 ucasiggcas

请教下，编码后的为啥需要+1？

for feature in features:
    lbe = LabelEncoder()
    data[feature] = lbe.fit_transform(data[feature]) + 1
    feature_max_idx[feature] = data[feature].max() + 1

thx

0留着用来做mask

Dec 16 '20 07:12 zanshuxun

常见是将评分低的做负样本

你确定召回是将评分低的做负样本吗

Dec 16 '20 07:12 zanshuxun

在这一行代码中，
input_from_feature_columns(user_features,user_feature_columns, l2_reg_embedding, init_std, seed,
                                                                                  embedding_matrix_dict=embedding_matrix_dict)
embedding_matrix_dict=embedding_matrix_dict 这句是没有必要的吧，没有这个关键词参数这是个很大的bug，input_from_feature_columns函数中
##    embedding_matrix_dict = create_embedding_matrix(feature_columns, l2_reg, init_std, seed, prefix=prefix,
##                                                    seq_mask_zero=seq_mask_zero)
这个应该去掉，再增加这个关键字参数。

input_from_feature_columns函数中不是有embedding_matrix_dict这个参数吗这两行为啥要去掉。。。

Dec 16 '20 07:12 zanshuxun

在用户特征中为啥有个hist_len这个特征？历史观看的movie长度也是个特征吗？？

>>> build_input_features(user_feature_columns)
OrderedDict([('user_id', <tf.Tensor 'user_id:0' shape=(?, 1) dtype=int32>), ('gender', <tf.Tensor 'gender:0' shape=(?, 1) dtype=int32>), ('age', <tf.Tensor 'age:0' shape=(?, 1) dtype=int32>), ('occupation', <tf.Tensor 'occupation:0' shape=(?, 1) dtype=int32>), ('zip', <tf.Tensor 'zip:0' shape=(?, 1) dtype=int32>), ('hist_movie_id', <tf.Tensor 'hist_movie_id:0' shape=(?, 50) dtype=int32>), ('hist_len', <tf.Tensor 'hist_len:0' shape=(?, 1) dtype=int32>)])

另外这些特征都是单个值的，假如有多标签或者说多值的话怎么办呢？

hist_len是用于后面做attention

你确定“这些特征都是单个值的”吗看看hist_movie_id

Dec 16 '20 07:12 zanshuxun

这个因为genres是多值特征，所以暂时没有加入。另外rating是label，所以没有作为特征使用

age本身是数值型，虽然被分段了，为何要LabelEncoder呢？直接用不好么？莫非是因为实际上很多人age字段缺失所以要当成一类，，，，

附：

Age is chosen from the following ranges:
- 1: "Under 18"
- 18: "18-24"
- 25: "25-34"
- 35: "35-44"
- 45: "45-49"
- 50: "50-55"
- 56: "56+"

然后我把SparseFeat("age", feature_max_idx['age'], embedding_dim)替换成DenseFeat("age", 1)，发现loss增加了o(╯□╰)o

Feb 04 '21 09:02 shuDaoNan9

mask

也就是preprocess.py这里吧？： train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0) 虽然我是看到这里填0了，了才知道为啥前面+1，也不知道啥是mask，反正填0后等长序列就对了O(∩_∩)O哈哈~

Feb 26 '21 01:02 shuDaoNan9

不过总感觉在前面（pre）填0才符合中国文字顺序习惯o(╯□╰)o

Feb 26 '21 02:02 shuDaoNan9

这个地方我也不太懂，楼主明白了吗，为什么直接没有用到rating？

Dec 10 '21 06:12 Bradyzzhang

DeepMatch DeepMatch copied to clipboard

例子中电影数据是哪个版本啊？有没有更多数据啊

DeepMatch
DeepMatch copied to clipboard