continuous_evaluation
continuous_evaluation copied to clipboard
CE模型对齐
CE模型添加多卡支持,待验证Model CE多卡加速比指标
对CE中模型进行梳理(见后面所附表),
模型如下:
image_classification vgg16 mnist object_detection resnet30 resnet50
seq2seq sequence_tagging_for_ner text_classification transformer language_model lstm
需要考虑增加和对齐的内容如下:
-
模型都改成多卡跑(4卡)(后续,我把指定卡放到外边,单卡、多卡均跑一遍)
-
每个模型的评价指标需要包含这4个数据(acc/ppl,cost ,mem 和 duration)
-
目前只监控了上述4个评价指标的diff,我观察到两种非预期情况,1 .跑得时间很短, acc 很低(0.1),2. 跑了很多轮, acc很低(0.1,模型自身有问题)。 暂时方案, 我们将轮数很低的加长(跑30min左右),将acc都统一调到0.5以上。 (后续我加上acc基数阈值告警。)
-
数据集统一使用现成的(而不是每次都下载), 放在默认的/root/.cache/paddle/dataset目录
模型 | 数据集 | Pass 轮数, | 当前执行情况 | 评价指标 | 参数 |
---|---|---|---|---|---|
Lstm 影评 Layers:words DynamicRNN | paddle.dataset.imdb as imdb http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz | 1轮 | Pass = 0, Iter = 49, Loss = 0.713064, Accuracy = 0.593750 nvidia-smi --id=%s --query-compute-apps=used_memory --format=csv -lms 1 > memory.txt | imdb_32_train_speed imdb_32_gpu_memory | batch_size: 32 device: GPU emb_dim: 512 gpu_id: 0 hidden_dim: 512 iterations: 50 skip_batch_num: 5 |
object_detection | dataset: pascalvoc 和coco 数据集 指定在/data/目录, 但没有 | Pass轮数:2 | IOError: [Errno 2] No such file or directory: '/data/pascalvoc/label_list' 需要在/data目录防止数据 | train_cost_kpi train_speed_kpi | batch_size: 64 is_toy: 0 iterations: 120 learning_rate: 0.001 num_passes: 2 parallel: True use_gpu: True |
Resnet50 | Flowers cifar http://www.robots.ox.ac.uk/~vgg/ data/flowers/102/102flowers.tgz | Pass 轮数:29(不收敛) | Pass:2, Loss:3.229035, Train Accuray:0.247656, Test Accuray:0.176471, Handle Images Duration: 63.949636 | cifar10_128_train_acc_kpi, cifar10_128_train_speed_kpi, cifar10_128_gpu_memory_kpi, flowers_64_train_speed_kpi, flowers_64_gpu_memory_kpi, 起了个线程取mem信息, 并没有评价acc等 | batch_size: 64 data_format: NCHW data_set: flowers device: GPU infer_only: False iterations: 80 model: resnet_imagenet pass_num: 3 skip_batch_num: 5 |
Pass:29, Loss:0.026319, Train Accuray:0.993359, Test Accuray:0.559400, Handle Images Duration: 22.501337 | |||||
language_model | /root/.cache/paddle/dataset/imikolov/ simple-examples.tgz | ppl:61.667 time_cost(s):18.544248 | |||
sequence_tagging_for_ner | 数据集 http://cs224d.stanford.edu/assignment2/ assignment2.zip | Pass轮数: 22轮 | download data error! 增加目录data后ok [TestSet] pass_id:2200 【pass num 每次增加100】pass_precision:[0.18181819] pass_recall:[0.125] pass_f1_score:[0.14814815] | train_acc_kpi, pass_duration_kpi, | |
text_classification | Imdb http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz | Pass:14 | avg_acc: 0.999800, avg_cost: 0.002255 | ||
Vgg16 | flowers/imagelabels.mat http://www.robots.ox.ac.uk/~vgg/data/ flowers/102/imagelabels.mat | 1轮 | cifar10 Pass: 1, Loss: 1.810090, Train Accuray: 0.234375 | cifar10_128_train_speed_kpi, cifar10_128_gpu_memory_kpi, flowers_32_train_speed_kpi, flowers_32_gpu_memory_kpi, 起了个线程取mem信息, 并没有评价acc等 | |
Pass: 49, Loss: 3.561218, Train Accuray: 0.093750 | |||||