PaddleScience
PaddleScience copied to clipboard
【Hackathon 8th No.13】Domino 论文复现
PR types
New Features
PR changes
Others
Describe
support domino
Thanks for your contribution!
复现Domino有两个问题:
- 目前只复现了模型内容,另外还有前处理和后处理部分,需要的CPU配置和内存非常高,aistudio上无法跑通
- 官方没有提供预训练权重
请先提交 RFC 设计文档
不好意思这里任务描述有误,需要改为【推理】和【训练】
@wangguan1995 可以提PR修改下任务描述 https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_8th/%E3%80%90Hackathon_8th%E3%80%91%E4%B8%AA%E4%BA%BA%E6%8C%91%E6%88%98%E8%B5%9B%E2%80%94%E5%A5%97%E4%BB%B6%E5%BC%80%E5%8F%91%E4%BB%BB%E5%8A%A1%E5%90%88%E9%9B%86.md#no13-domino-%E8%AE%BA%E6%96%87%E5%A4%8D%E7%8E%B0
@wangguan1995 目前模型已经能正常训练,精度通过Padiff验证通过。训练代码存在随机性,每个step前处理数据没法通过随机数种子固定。 目前还有需要验证的任务有: (1)目前缺少数据集,仅通过一个样本进行训练,50个epoch loss正常下降 (2)推理代码已经适配,但前处理部分仍需进行点云处理,aistudio上无法处理,这部分需要继续验证
- 目前代码仓库存在大量相对路径
- 下载脚本目前有一些问题(aws自己的问题,先标记在文档里)
- run_1作为验证训练的数据,10个epoch的torch对比日志贴在这里
- 需要做的是前向loss 1e-5级别的对齐
前10 个epoch日志:
torch
Device cuda:0, batch processed: 1, loss volume: 0.23092692 , loss surface: 0.06465874, loss integral: 0.00000000, loss surface area: 0.00293629
Device cuda:0, batch: 1, loss norm: 0.26472443
Device cuda:0 LOSS train 0.26472443 valid 0.19338508 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 118.63378143310547
Device cuda:0, epoch 1:
Device cuda:0, batch processed: 1, loss volume: 0.17430124 , loss surface: 0.03637335, loss integral: 0.00000000, loss surface area: 0.00179438
Device cuda:0, batch: 1, loss norm: 0.19338511
Device cuda:0 LOSS train 0.19338511 valid 0.11204524 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 114.84539103507996
Device cuda:0, epoch 2:
Device cuda:0, batch processed: 1, loss volume: 0.09825166 , loss surface: 0.02525518, loss integral: 0.00000000, loss surface area: 0.00233155
Device cuda:0, batch: 1, loss norm: 0.11204502
Device cuda:0 LOSS train 0.11204502 valid 0.10328176 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 115.87338018417358
Device cuda:0, epoch 3:
Device cuda:0, batch processed: 1, loss volume: 0.09899043 , loss surface: 0.00795730, loss integral: 0.00000000, loss surface area: 0.00062173
Device cuda:0, batch: 1, loss norm: 0.10327994
Device cuda:0 LOSS train 0.10327994 valid 0.05417285 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 116.46289348602295
Device cuda:0, epoch 4:
Device cuda:0, batch processed: 1, loss volume: 0.04868028 , loss surface: 0.01017947, loss integral: 0.00000000, loss surface area: 0.00080577
Device cuda:0, batch: 1, loss norm: 0.05417290
Device cuda:0 LOSS train 0.05417290 valid 0.08227389 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.91964483261108
Device cuda:0, epoch 5:
Device cuda:0, batch processed: 1, loss volume: 0.07733711 , loss surface: 0.00911645, loss integral: 0.00000000, loss surface area: 0.00075788
Device cuda:0, batch: 1, loss norm: 0.08227427
Device cuda:0 LOSS train 0.08227427 valid 0.08577856 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.41759729385376
Device cuda:0, epoch 6:
Device cuda:0, batch processed: 1, loss volume: 0.08035985 , loss surface: 0.01013019, loss integral: 0.00000000, loss surface area: 0.00070755
Device cuda:0, batch: 1, loss norm: 0.08577872
Device cuda:0 LOSS train 0.08577872 valid 0.06831404 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.05477333068848
Device cuda:0, epoch 7:
Device cuda:0, batch processed: 1, loss volume: 0.06417362 , loss surface: 0.00766032, loss integral: 0.00000000, loss surface area: 0.00062155
Device cuda:0, batch: 1, loss norm: 0.06831456
Device cuda:0 LOSS train 0.06831456 valid 0.04400067 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 116.03830194473267
Device cuda:0, epoch 8:
Device cuda:0, batch processed: 1, loss volume: 0.04010706 , loss surface: 0.00724047, loss integral: 0.00000000, loss surface area: 0.00054595
Device cuda:0, batch: 1, loss norm: 0.04400026
Device cuda:0 LOSS train 0.04400026 valid 0.04530785 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 115.86109900474548
Device cuda:0, epoch 9:
Device cuda:0, batch processed: 1, loss volume: 0.04175998 , loss surface: 0.00659478, loss integral: 0.00000000, loss surface area: 0.00050289
Device cuda:0, batch: 1, loss norm: 0.04530882
Device cuda:0 LOSS train 0.04530882 valid 0.06185470 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 115.73935866355896
Device cuda:0, epoch 10:
Device cuda:0, batch processed: 1, loss volume: 0.05804047 , loss surface: 0.00707586, loss integral: 0.00000000, loss surface area: 0.00055142
Device cuda:0, batch: 1, loss norm: 0.06185411
Device cuda:0 LOSS train 0.06185411 valid 0.04006197 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.04006196931004524, Time taken 116.31600141525269
paddle:
Device gpu:0, batch processed: 1, loss volume: 0.23092692 , loss surface: 0.06465873, loss integral: 0.00000000, loss surface area: 0.00293629
Device gpu:0, batch: 1, loss norm: 0.26472443
Loss/train: 0.2647244334220886/1
Device gpu:0 LOSS train 0.26472443 valid 0.19338842 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 118.70474576950073
Device gpu:0, epoch 1:
Device gpu:0, batch processed: 1, loss volume: 0.17430471 , loss surface: 0.03637305, loss integral: 0.00000000, loss surface area: 0.00179433
Device gpu:0, batch: 1, loss norm: 0.19338840
Loss/train: 0.19338840246200562/2
Device gpu:0 LOSS train 0.19338840 valid 0.11218615 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 115.69161486625671
Device gpu:0, epoch 2:
Device gpu:0, batch processed: 1, loss volume: 0.09839579 , loss surface: 0.02524963, loss integral: 0.00000000, loss surface area: 0.00233097
Device gpu:0, batch: 1, loss norm: 0.11218609
Loss/train: 0.11218608915805817/3
Device gpu:0 LOSS train 0.11218609 valid 0.10281664 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 115.92429780960083
Device gpu:0, epoch 3:
Device gpu:0, batch processed: 1, loss volume: 0.09853126 , loss surface: 0.00795383, loss integral: 0.00000000, loss surface area: 0.00062137
Device gpu:0, batch: 1, loss norm: 0.10281885
Loss/train: 0.10281885415315628/4
Device gpu:0 LOSS train 0.10281885 valid 0.05420898 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 116.74323606491089
Device gpu:0, epoch 4:
Device gpu:0, batch processed: 1, loss volume: 0.04871760 , loss surface: 0.01017731, loss integral: 0.00000000, loss surface area: 0.00080539
Device gpu:0, batch: 1, loss norm: 0.05420895
Loss/train: 0.05420895293354988/5
Device gpu:0 LOSS train 0.05420895 valid 0.08210348 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 115.5925304889679
Device gpu:0, epoch 5:
Device gpu:0, batch processed: 1, loss volume: 0.07717736 , loss surface: 0.00909785, loss integral: 0.00000000, loss surface area: 0.00075586
Device gpu:0, batch: 1, loss norm: 0.08210421
Loss/train: 0.08210421353578568/6
Device gpu:0 LOSS train 0.08210421 valid 0.08545748 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 115.61996293067932
Device gpu:0, epoch 6:
Device gpu:0, batch processed: 1, loss volume: 0.08003174 , loss surface: 0.01014488, loss integral: 0.00000000, loss surface area: 0.00070565
Device gpu:0, batch: 1, loss norm: 0.08545700
Loss/train: 0.08545700460672379/7
Device gpu:0 LOSS train 0.08545700 valid 0.06784783 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 116.20984172821045
Device gpu:0, epoch 7:
Device gpu:0, batch processed: 1, loss volume: 0.06372426 , loss surface: 0.00762877, loss integral: 0.00000000, loss surface area: 0.00061806
Device gpu:0, batch: 1, loss norm: 0.06784768
Loss/train: 0.06784767657518387/8
Device gpu:0 LOSS train 0.06784768 valid 0.04360897 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 116.99345993995667
Device gpu:0, epoch 8:
Device gpu:0, batch processed: 1, loss volume: 0.03972780 , loss surface: 0.00721701, loss integral: 0.00000000, loss surface area: 0.00054254
Device gpu:0, batch: 1, loss norm: 0.04360757
Loss/train: 0.043607573956251144/9
Device gpu:0 LOSS train 0.04360757 valid 0.04554129 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 116.08153223991394
Device gpu:0, epoch 9:
Device gpu:0, batch processed: 1, loss volume: 0.04196901 , loss surface: 0.00663531, loss integral: 0.00000000, loss surface area: 0.00050686
Device gpu:0, batch: 1, loss norm: 0.04554009
Loss/train: 0.04554009437561035/10
Device gpu:0 LOSS train 0.04554009 valid 0.06155418 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 115.41889214515686
Device gpu:0, epoch 10:
Device gpu:0, batch processed: 1, loss volume: 0.05775697 , loss surface: 0.00704247, loss integral: 0.00000000, loss surface area: 0.00054853
Device gpu:0, batch: 1, loss norm: 0.06155247
Loss/train: 0.061552468687295914/11
Device gpu:0 LOSS train 0.06155247 valid 0.04058465 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04058464989066124, Time taken 116.95455741882324
目前计划进行数据精度验证
- 5.19 ~ 5.21 进行数据处理
- 5.22 ~ 5.23 进行小数据集调试,并行调试
- 5.23 ~ 5.26 完成验证进行合入