DeepCTR icon indicating copy to clipboard operation
DeepCTR copied to clipboard

使用DeepFMEstimator,GPU物理机 , 速度不增反降(每轮40分钟),训练数据大概35G

Open whk6688 opened this issue 4 years ago • 6 comments

train_df=pd.read_parquet("parquet_train_20201219_110000",engine='pyarrow') test_df=pd.read_parquet("parquet_test_20201220_110000",engine='pyarrow') df = pd.concat([train_df,test_df],axis=0)

train_size=len(train_df) test_size=len(test_df)

target = ['label'] dense_features = ["c1","c2","c3","c4","c5"] sparse_features = [x for x in df.columns if x not in dense_features+target]

for feat in sparse_features: lbe = LabelEncoder() df[feat] = lbe.fit_transform(df[feat]) mms = MinMaxScaler(feature_range=(0, 1))

df[sparse_features] = df[sparse_features].fillna('-1', ) df[dense_features] = mms.fit_transform(df[dense_features])

dnn_feature_columns = [] linear_feature_columns = []

for i, feat in enumerate(sparse_features): dnn_feature_columns.append(tf.feature_column.embedding_column( tf.feature_column.categorical_column_with_identity(feat, df[feat].nunique()), 4)) linear_feature_columns.append(tf.feature_column.categorical_column_with_identity(feat, df[feat].nunique())) for feat in dense_features: dnn_feature_columns.append(tf.feature_column.numeric_column(feat)) linear_feature_columns.append(tf.feature_column.numeric_column(feat))

train = df[0:train_size] test=df[train_size:]

train_model_input = input_fn_pandas(train, sparse_features + dense_features, 'label', shuffle=True) test_model_input = input_fn_pandas(test, sparse_features + dense_features, None, shuffle=False)

model = DeepFMEstimator(linear_feature_columns, dnn_feature_columns, task='binary')

model.train(train_model_input) pred_ans_iter = model.predict(test_model_input) pred_ans = list(map(lambda x: x['pred'], pred_ans_iter)) print("test AUC", round(roc_auc_score(test[target].values, pred_ans), 4))

还有其他的设置吗,设置多gpu跑,也不太快。GPU 利用率20%。

whk6688 avatar Dec 31 '20 15:12 whk6688

这个问题也遇到过,经过排查,是feature columns 的问题,用feature columns会导致数据IO变得很慢,用TensorFlow Profiler看下就知道了

timajia avatar Mar 08 '21 14:03 timajia

这个问题也遇到过,经过排查,是feature columns 的问题,用feature columns会导致数据IO变得很慢,用TensorFlow Profiler看下就知道了

请问那有什么解决方法吗?

yeqing97 avatar Mar 09 '21 02:03 yeqing97

解决方案

  • 用其他方式处理数据,例如hives sql,或者spark,flink等工具,tf这边只是接受数据,不处理数据
  • 自己用dataset实现,feature columns只是通用性比较好,性能不行

timajia avatar Mar 09 '21 03:03 timajia

解决方案

  • 用其他方式处理数据,例如hives sql,或者spark,flink等工具,tf这边只是接受数据,不处理数据
  • 自己用dataset实现,feature columns只是通用性比较好,性能不行

您好!我想问下用dataset实现具体是指什么,是需要自己修改模型吗?

yeqing97 avatar Mar 10 '21 11:03 yeqing97

你好,请问训练过程有打印出来每个batch的loss吗,为什么我这里什么都不打印,默默写模型文件

1980695671 avatar May 08 '21 09:05 1980695671

你好,请问训练过程有打印出来每个batch的loss吗,为什么我这里什么都不打印,默默写模型文件

有的,你看看是不是verbose设置的问题

yeqing97 avatar May 10 '21 01:05 yeqing97