PaddleScience icon indicating copy to clipboard operation
PaddleScience copied to clipboard

【Hackathon 8th No.13】Domino 论文复现

Open xiaoyewww opened this issue 8 months ago • 6 comments

PR types

New Features

PR changes

Others

Describe

support domino

xiaoyewww avatar Mar 04 '25 17:03 xiaoyewww

Thanks for your contribution!

paddle-bot[bot] avatar Mar 04 '25 17:03 paddle-bot[bot]

复现Domino有两个问题:

  1. 目前只复现了模型内容,另外还有前处理和后处理部分,需要的CPU配置和内存非常高,aistudio上无法跑通
  2. 官方没有提供预训练权重

xiaoyewww avatar Mar 04 '25 17:03 xiaoyewww

请先提交 RFC 设计文档

luotao1 avatar Mar 05 '25 12:03 luotao1

不好意思这里任务描述有误,需要改为【推理】和【训练】

wangguan1995 avatar Mar 10 '25 11:03 wangguan1995

@wangguan1995 可以提PR修改下任务描述 https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_8th/%E3%80%90Hackathon_8th%E3%80%91%E4%B8%AA%E4%BA%BA%E6%8C%91%E6%88%98%E8%B5%9B%E2%80%94%E5%A5%97%E4%BB%B6%E5%BC%80%E5%8F%91%E4%BB%BB%E5%8A%A1%E5%90%88%E9%9B%86.md#no13-domino-%E8%AE%BA%E6%96%87%E5%A4%8D%E7%8E%B0

luotao1 avatar Mar 11 '25 02:03 luotao1

@wangguan1995 目前模型已经能正常训练,精度通过Padiff验证通过。训练代码存在随机性,每个step前处理数据没法通过随机数种子固定。 目前还有需要验证的任务有: (1)目前缺少数据集,仅通过一个样本进行训练,50个epoch loss正常下降 (2)推理代码已经适配,但前处理部分仍需进行点云处理,aistudio上无法处理,这部分需要继续验证

xiaoyewww avatar Mar 15 '25 15:03 xiaoyewww

  1. 目前代码仓库存在大量相对路径
  2. 下载脚本目前有一些问题(aws自己的问题,先标记在文档里)
  3. run_1作为验证训练的数据,10个epoch的torch对比日志贴在这里
  4. 需要做的是前向loss 1e-5级别的对齐

wangguan1995 avatar Apr 03 '25 08:04 wangguan1995

前10 个epoch日志:

torch

Device cuda:0, batch processed: 1, loss volume: 0.23092692             , loss surface: 0.06465874, loss integral: 0.00000000, loss surface area: 0.00293629
 Device cuda:0,  batch: 1, loss norm: 0.26472443
Device cuda:0 LOSS train 0.26472443 valid 0.19338508 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 118.63378143310547

Device cuda:0, epoch 1:
Device cuda:0, batch processed: 1, loss volume: 0.17430124             , loss surface: 0.03637335, loss integral: 0.00000000, loss surface area: 0.00179438
 Device cuda:0,  batch: 1, loss norm: 0.19338511
Device cuda:0 LOSS train 0.19338511 valid 0.11204524 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 114.84539103507996

Device cuda:0, epoch 2:
Device cuda:0, batch processed: 1, loss volume: 0.09825166             , loss surface: 0.02525518, loss integral: 0.00000000, loss surface area: 0.00233155
 Device cuda:0,  batch: 1, loss norm: 0.11204502
Device cuda:0 LOSS train 0.11204502 valid 0.10328176 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 115.87338018417358

Device cuda:0, epoch 3:
Device cuda:0, batch processed: 1, loss volume: 0.09899043             , loss surface: 0.00795730, loss integral: 0.00000000, loss surface area: 0.00062173
 Device cuda:0,  batch: 1, loss norm: 0.10327994
Device cuda:0 LOSS train 0.10327994 valid 0.05417285 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 116.46289348602295

Device cuda:0, epoch 4:
Device cuda:0, batch processed: 1, loss volume: 0.04868028             , loss surface: 0.01017947, loss integral: 0.00000000, loss surface area: 0.00080577
 Device cuda:0,  batch: 1, loss norm: 0.05417290
Device cuda:0 LOSS train 0.05417290 valid 0.08227389 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.91964483261108

Device cuda:0, epoch 5:
Device cuda:0, batch processed: 1, loss volume: 0.07733711             , loss surface: 0.00911645, loss integral: 0.00000000, loss surface area: 0.00075788
 Device cuda:0,  batch: 1, loss norm: 0.08227427
Device cuda:0 LOSS train 0.08227427 valid 0.08577856 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.41759729385376

Device cuda:0, epoch 6:
Device cuda:0, batch processed: 1, loss volume: 0.08035985             , loss surface: 0.01013019, loss integral: 0.00000000, loss surface area: 0.00070755
 Device cuda:0,  batch: 1, loss norm: 0.08577872
Device cuda:0 LOSS train 0.08577872 valid 0.06831404 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.05477333068848

Device cuda:0, epoch 7:
Device cuda:0, batch processed: 1, loss volume: 0.06417362             , loss surface: 0.00766032, loss integral: 0.00000000, loss surface area: 0.00062155
 Device cuda:0,  batch: 1, loss norm: 0.06831456
Device cuda:0 LOSS train 0.06831456 valid 0.04400067 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 116.03830194473267

Device cuda:0, epoch 8:
Device cuda:0, batch processed: 1, loss volume: 0.04010706             , loss surface: 0.00724047, loss integral: 0.00000000, loss surface area: 0.00054595
 Device cuda:0,  batch: 1, loss norm: 0.04400026
Device cuda:0 LOSS train 0.04400026 valid 0.04530785 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 115.86109900474548

Device cuda:0, epoch 9:
Device cuda:0, batch processed: 1, loss volume: 0.04175998             , loss surface: 0.00659478, loss integral: 0.00000000, loss surface area: 0.00050289
 Device cuda:0,  batch: 1, loss norm: 0.04530882
Device cuda:0 LOSS train 0.04530882 valid 0.06185470 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 115.73935866355896

Device cuda:0, epoch 10:
Device cuda:0, batch processed: 1, loss volume: 0.05804047             , loss surface: 0.00707586, loss integral: 0.00000000, loss surface area: 0.00055142
 Device cuda:0,  batch: 1, loss norm: 0.06185411
Device cuda:0 LOSS train 0.06185411 valid 0.04006197 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.04006196931004524, Time taken 116.31600141525269

paddle:

Device gpu:0, batch processed: 1, loss volume: 0.23092692             , loss surface: 0.06465873, loss integral: 0.00000000, loss surface area: 0.00293629
 Device gpu:0,  batch: 1, loss norm: 0.26472443
Loss/train: 0.2647244334220886/1
Device gpu:0 LOSS train 0.26472443 valid 0.19338842 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 118.70474576950073

Device gpu:0, epoch 1:
Device gpu:0, batch processed: 1, loss volume: 0.17430471             , loss surface: 0.03637305, loss integral: 0.00000000, loss surface area: 0.00179433
 Device gpu:0,  batch: 1, loss norm: 0.19338840
Loss/train: 0.19338840246200562/2
Device gpu:0 LOSS train 0.19338840 valid 0.11218615 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 115.69161486625671

Device gpu:0, epoch 2:
Device gpu:0, batch processed: 1, loss volume: 0.09839579             , loss surface: 0.02524963, loss integral: 0.00000000, loss surface area: 0.00233097
 Device gpu:0,  batch: 1, loss norm: 0.11218609
Loss/train: 0.11218608915805817/3
Device gpu:0 LOSS train 0.11218609 valid 0.10281664 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 115.92429780960083

Device gpu:0, epoch 3:
Device gpu:0, batch processed: 1, loss volume: 0.09853126             , loss surface: 0.00795383, loss integral: 0.00000000, loss surface area: 0.00062137
 Device gpu:0,  batch: 1, loss norm: 0.10281885
Loss/train: 0.10281885415315628/4
Device gpu:0 LOSS train 0.10281885 valid 0.05420898 Current lr 0.001Integral factor 0

Device gpu:0, Best val loss 0.05420897901058197, Time taken 116.74323606491089

Device gpu:0, epoch 4:
Device gpu:0, batch processed: 1, loss volume: 0.04871760             , loss surface: 0.01017731, loss integral: 0.00000000, loss surface area: 0.00080539
 Device gpu:0,  batch: 1, loss norm: 0.05420895
Loss/train: 0.05420895293354988/5
Device gpu:0 LOSS train 0.05420895 valid 0.08210348 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 115.5925304889679

Device gpu:0, epoch 5:
Device gpu:0, batch processed: 1, loss volume: 0.07717736             , loss surface: 0.00909785, loss integral: 0.00000000, loss surface area: 0.00075586
 Device gpu:0,  batch: 1, loss norm: 0.08210421
Loss/train: 0.08210421353578568/6
Device gpu:0 LOSS train 0.08210421 valid 0.08545748 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 115.61996293067932

Device gpu:0, epoch 6:
Device gpu:0, batch processed: 1, loss volume: 0.08003174             , loss surface: 0.01014488, loss integral: 0.00000000, loss surface area: 0.00070565
 Device gpu:0,  batch: 1, loss norm: 0.08545700
Loss/train: 0.08545700460672379/7
Device gpu:0 LOSS train 0.08545700 valid 0.06784783 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 116.20984172821045

Device gpu:0, epoch 7:
Device gpu:0, batch processed: 1, loss volume: 0.06372426             , loss surface: 0.00762877, loss integral: 0.00000000, loss surface area: 0.00061806
 Device gpu:0,  batch: 1, loss norm: 0.06784768
Loss/train: 0.06784767657518387/8
Device gpu:0 LOSS train 0.06784768 valid 0.04360897 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 116.99345993995667

Device gpu:0, epoch 8:
Device gpu:0, batch processed: 1, loss volume: 0.03972780             , loss surface: 0.00721701, loss integral: 0.00000000, loss surface area: 0.00054254
 Device gpu:0,  batch: 1, loss norm: 0.04360757
Loss/train: 0.043607573956251144/9
Device gpu:0 LOSS train 0.04360757 valid 0.04554129 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 116.08153223991394

Device gpu:0, epoch 9:
Device gpu:0, batch processed: 1, loss volume: 0.04196901             , loss surface: 0.00663531, loss integral: 0.00000000, loss surface area: 0.00050686
 Device gpu:0,  batch: 1, loss norm: 0.04554009
Loss/train: 0.04554009437561035/10
Device gpu:0 LOSS train 0.04554009 valid 0.06155418 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 115.41889214515686

Device gpu:0, epoch 10:
Device gpu:0, batch processed: 1, loss volume: 0.05775697             , loss surface: 0.00704247, loss integral: 0.00000000, loss surface area: 0.00054853
 Device gpu:0,  batch: 1, loss norm: 0.06155247
Loss/train: 0.061552468687295914/11
Device gpu:0 LOSS train 0.06155247 valid 0.04058465 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04058464989066124, Time taken 116.95455741882324

xiaoyewww avatar Apr 03 '25 15:04 xiaoyewww

目前计划进行数据精度验证

  • 5.19 ~ 5.21 进行数据处理
  • 5.22 ~ 5.23 进行小数据集调试,并行调试
  • 5.23 ~ 5.26 完成验证进行合入

wangguan1995 avatar May 19 '25 02:05 wangguan1995