cail2019_track2
cail2019_track2 copied to clipboard
请问怎样才使程序在GPU上运行?
请问怎样才使程序在GPU上运行?服务器用的是腾讯云的GPU,也是用你的命令,可是不管怎么试,都是用CPU运行的。请问,还需要装什么模块吗?
你好,你可以先测试下你的tensorflow是否可以使用gpu,如果不行,应该是tensorflow的安装问题,比如cuda版本不对应什么的
| | m13021933043 邮箱:[email protected] |
Signature is customized by Netease Mail Master
在2020年07月08日 19:24,zhouyang-bigdata 写道:
请问怎样才使程序在GPU上运行?服务器用的是腾讯云的GPU,也是用你的命令,可是不管怎么试,都是用CPU运行的。请问,还需要装什么模块吗?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
一番折腾,后腾讯云换了个系统镜像,这应该是在gpu上运行了。请问这个训练一般耗时多久? 日志如下:
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/kernel:0, shape = (968, 800) INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/bias:0, shape = (800,) INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/kernel:0, shape = (968, 800) INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/bias:0, shape = (800,) INFO:tensorflow: name = u_omega:0, shape = (1168,) INFO:tensorflow: name = output_weights:0, shape = (20, 1168) INFO:tensorflow: name = output_bias:0, shape = (20,) WARNING:tensorflow:From /usr/local/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2020-07-09 09:42:55.471208: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-07-09 09:42:55.649475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-09 09:42:55.650252: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x8a58eb0 executing computations on platform CUDA. Devices: 2020-07-09 09:42:55.650293: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0 2020-07-09 09:42:55.663763: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz 2020-07-09 09:42:55.664574: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x9a2a320 executing computations on platform Host. Devices: 2020-07-09 09:42:55.664608: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0):
, 2020-07-09 09:42:55.665581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:08.0 totalMemory: 31.72GiB freeMemory: 31.31GiB 2020-07-09 09:42:55.665603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-07-09 09:42:55.666469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-09 09:42:55.666486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-07-09 09:42:55.666493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-07-09 09:42:55.666986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ckpt/divorce/model.ckpt. 2020-07-09 09:43:22.977919: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally INFO:tensorflow:global_step/sec: 2.03341 INFO:tensorflow:examples/sec: 65.0693 INFO:tensorflow:global_step/sec: 2.27393 INFO:tensorflow:examples/sec: 72.7656 INFO:tensorflow:global_step/sec: 2.27549 INFO:tensorflow:examples/sec: 72.8157 INFO:tensorflow:global_step/sec: 2.2709 INFO:tensorflow:examples/sec: 72.6686 INFO:tensorflow:global_step/sec: 2.27153 INFO:tensorflow:examples/sec: 72.6891
看你设置的epoch和服务器的性能,你这个应该是每一步大概2s,自己可以计算下大概耗时
| | m13021933043 邮箱:[email protected] |
Signature is customized by Netease Mail Master
在2020年07月09日 09:50,zhouyang-bigdata 写道:
一番折腾,后腾讯云换了个系统镜像,这应该是在gpu上运行了。请问这个训练一般耗时多久? 日志如下:
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/kernel:0, shape = (968, 800) INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/bias:0, shape = (800,) INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/kernel:0, shape = (968, 800) INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/bias:0, shape = (800,) INFO:tensorflow: name = u_omega:0, shape = (1168,) INFO:tensorflow: name = output_weights:0, shape = (20, 1168) INFO:tensorflow: name = output_bias:0, shape = (20,) WARNING:tensorflow:From /usr/local/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2020-07-09 09:42:55.471208: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-07-09 09:42:55.649475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-09 09:42:55.650252: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x8a58eb0 executing computations on platform CUDA. Devices: 2020-07-09 09:42:55.650293: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0 2020-07-09 09:42:55.663763: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz 2020-07-09 09:42:55.664574: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x9a2a320 executing computations on platform Host. Devices: 2020-07-09 09:42:55.664608: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2020-07-09 09:42:55.665581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:08.0 totalMemory: 31.72GiB freeMemory: 31.31GiB 2020-07-09 09:42:55.665603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-07-09 09:42:55.666469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-09 09:42:55.666486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-07-09 09:42:55.666493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-07-09 09:42:55.666986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ckpt/divorce/model.ckpt. 2020-07-09 09:43:22.977919: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally INFO:tensorflow:global_step/sec: 2.03341 INFO:tensorflow:examples/sec: 65.0693 INFO:tensorflow:global_step/sec: 2.27393 INFO:tensorflow:examples/sec: 72.7656 INFO:tensorflow:global_step/sec: 2.27549 INFO:tensorflow:examples/sec: 72.8157 INFO:tensorflow:global_step/sec: 2.2709 INFO:tensorflow:examples/sec: 72.6686 INFO:tensorflow:global_step/sec: 2.27153 INFO:tensorflow:examples/sec: 72.6891
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
涨知识了。之前用pytorch时候,还没注意到step。
这是tensorflow的estimator训练方式,如果修改成session方式,我感觉可以更灵活,可以像torch一样打印训练日志和进度,只不过比较复杂点
| | m13021933043 邮箱:[email protected] |
Signature is customized by Netease Mail Master
在2020年07月09日 10:12,zhouyang-bigdata 写道:
涨知识了。之前用pytorch时候,还没注意到step。
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
训练出来,准确率有0.83. 是训练集和测试集放一起训练了吧? 日志:
99%|█████████▉| 249/252 [01:36<00:01, 2.76it/s] 99%|█████████▉| 250/252 [01:36<00:00, 2.77it/s] 100%|█████████▉| 251/252 [01:37<00:00, 2.76it/s] 100%|██████████| 252/252 [01:37<00:00, 2.76it/s] 100%|██████████| 252/252 [01:37<00:00, 2.58it/s] INFO:root:模型预测结束
INFO:root:模型每个类别f值计算如下:
INFO:root:{'1': 0.96, '2': 0.92, '3': 0.91, '4': 0.93, '5': 0.91, '6': 0.93, '7': 0.93, '8': 0.97, '9': 0.98, '10': 0.88, '11': 0.84, '12': 0.25, '13': 0.83, '14': 0.47, '15': 0.82, '16': 0.79, '17': 0.72, '18': 0.03, '19': 0.32, '20': 0.64} INFO:root:总评分如下: 0.8298107041994647
可能有一部分重复吧,毕竟官方没有开源测试数据。也有可能是使用的使用的divorce那个数据集,那个数据集相对其他两个效果要好点。
| | m13021933043 邮箱:[email protected] |
Signature is customized by Netease Mail Master
在2020年07月09日 10:53,zhouyang-bigdata 写道:
训练出来,准确率有0.83. 是训练集和测试集放一起训练了吧? 日志:
99%|█████████▉| 249/252 [01:36<00:01, 2.76it/s] 99%|█████████▉| 250/252 [01:36<00:00, 2.77it/s] 100%|█████████▉| 251/252 [01:37<00:00, 2.76it/s] 100%|██████████| 252/252 [01:37<00:00, 2.76it/s] 100%|██████████| 252/252 [01:37<00:00, 2.58it/s] INFO:root:模型预测结束
INFO:root:模型每个类别f值计算如下:
INFO:root:{'1': 0.96, '2': 0.92, '3': 0.91, '4': 0.93, '5': 0.91, '6': 0.93, '7': 0.93, '8': 0.97, '9': 0.98, '10': 0.88, '11': 0.84, '12': 0.25, '13': 0.83, '14': 0.47, '15': 0.82, '16': 0.79, '17': 0.72, '18': 0.03, '19': 0.32, '20': 0.64} INFO:root:总评分如下: 0.8298107041994647
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
应该是训练集(divorce)比较少的原因吧。我看到训练集只有1.93M。而我以前下载的官方训练集,有6.08M。 请问一下,怎样设置多个GPU一起训练?
用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。 日志如下:
98%|█████████▊| 247/252 [01:35<00:01, 2.76it/s] 98%|█████████▊| 248/252 [01:36<00:01, 2.75it/s] 99%|█████████▉| 249/252 [01:36<00:01, 2.75it/s] 99%|█████████▉| 250/252 [01:37<00:00, 2.74it/s] 100%|█████████▉| 251/252 [01:37<00:00, 2.73it/s] 100%|██████████| 252/252 [01:37<00:00, 2.73it/s] INFO:root:模型预测结束
INFO:root:模型每个类别f值计算如下:
INFO:root:{'1': 0.95, '2': 0.91, '3': 0.91, '4': 0.94, '5': 0.9, '6': 0.92, '7': 0.92, '8': 0.96, '9': 0.98, '10': 0.87, '11': 0.84, '12': 0.21, '13': 0.8, '14': 0.31, '15': 0.81, '16': 0.77, '17': 0.64, '18': 0.0, '19': 0.2, '20': 0.61} INFO:root:总评分如下: 0.811298028037752
这应该是我训练数据文件名的问题。我多训练几遍再看看。
用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。
请问一下2个问题: (1)请问你之前用的训练数据是多大的?我想重现你的结果 (2)请问一下,怎样设置多个GPU一起训练?
我之前的训练数据就是我分享的所有数据,多gpu的话团队bert代码没办法,需要更改优化器部分,或者使用horovod。再或者还pytorch,多卡很方便...
| | m13021933043 邮箱:[email protected] |
Signature is customized by Netease Mail Master
在2020年07月10日 18:00,zhouyang-bigdata 写道:
请问一下2个问题: (1)请问你之前用的训练数据是多大的? (2)请问一下,怎样设置多个GPU一起训练?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
你好,我用你的数据训练loan(借贷)类别的模型,准确率在0.61.请问这在合理范围吗?看起来,第10个以后tag的f值是0.0,请问是没有匹配到该tag吗?
'11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0
如下:
92%|█████████▏| 47/51 [01:11<00:05, 1.45s/it] 94%|█████████▍| 48/51 [01:13<00:04, 1.44s/it] 96%|█████████▌| 49/51 [01:14<00:02, 1.45s/it] 98%|█████████▊| 50/51 [01:15<00:01, 1.45s/it] 100%|██████████| 51/51 [01:17<00:00, 1.45s/it] INFO:root:模型预测结束
INFO:root:模型每个类别f值计算如下:
INFO:root:{'1': 0.87, '2': 0.81, '3': 0.8, '4': 0.76, '5': 0.8, '6': 0.6, '7': 0.86, '8': 0.96, '9': 0.82, '10': 0.89, '11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0} INFO:root:总评分如下: 0.6121632632937084
evaluation.py代码如下:
if name == 'main':
task = "loan" ##这里传入切分好的测试数据,这里由于是整理代码做测试,随便导入训练数据集测试下 sentences, labels = load_file("data/loan/data_small_selected.json") #sentences, labels = load_file("my_test_data.json") logging.info("开始载入bert模型") model_1 = BERTModel(task=task, pb_model="pb/loan/model.pb", tagDir="data/loan/tags.txt", threshold=[0.5] * 20, vocab_file="chinese_L-12_H-768_A-12/vocab.txt") logging.info("bert模型载入完毕,开始进行预测!!!\n") logging.info("模型开始预测\n") predicts_1 = model_1.getAllResult(sentences) print(predicts_1) logging.info("结果:\n") logging.info(predicts_1) logging.info("模型预测结束\n") logging.info("模型每个类别f值计算如下:\n") score_1, f1_1 = evaluate(predict_labels=predicts_1, target_labels=labels, tag_dir="data/loan/tags.txt") logging.info(f1_1) logging.info("总评分如下: {}".format(score_1))
这应该是我训练数据文件名的问题。我多训练几遍再看看。
用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。
这个很可能是我测试用的数据不对。改为data_small_selected.json后,试了2次,是0.71 ,很接近了,不过,0.73没重现过。
测试数据都不一样...我的成绩是官网测试成绩,而且一些trick代码我没有发在github,只是readme写了介绍
| | m13021933043 邮箱:[email protected] |
Signature is customized by Netease Mail Master
在2020年07月15日 09:55,zhouyang-bigdata 写道:
这应该是我训练数据文件名的问题。我多训练几遍再看看。
用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。
这个很可能是我测试用的数据不对。改为data_small_selected.json后,试了2次,是0.71 ,很接近了,不过,0.73没重现过。
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
测试数据都不一样...我的成绩是官网测试成绩,而且一些trick代码我没有发在github,只是readme写了介绍 | | m13021933043 邮箱:[email protected] | Signature is customized by Netease Mail Master 在2020年07月15日 09:55,zhouyang-bigdata 写道: 这应该是我训练数据文件名的问题。我多训练几遍再看看。 用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。 这个很可能是我测试用的数据不对。改为data_small_selected.json后,试了2次,是0.71 ,很接近了,不过,0.73没重现过。 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
还有一个想请问下,就是关于:
你好,我用你的数据训练loan(借贷)类别的模型,准确率在0.61.请问这在合理范围吗?看起来,第10个以后tag的f值是0.0,请问是没有匹配到该tag吗?
'11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0
我qq 2648759823 ,你qq多少。
请问一下,这个是模型正常输出吗?
还有一个想请问下,就是关于:
你好,我用你的数据训练loan(借贷)类别的模型,准确率在0.61.请问这在合理范围吗?看起来,第10个以后tag的f值是0.0,请问是没有匹配到该tag吗?
'11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0
你好,我是自己在做一个新闻、文章要素提取Demo,卡在这里了。能qq聊会吗
你好,可以qq聊下吗?请教一下。
你好,可以qq聊下吗?请教一下。我qq 2648759823