panshaowu comments

Results 11 comments of


                                            panshaowu

Issue on running distributed training

@ThomasLimWZ Hello, thanks for your feedback. As far as I know, mindspore' support for the Windows OS is incomplete. Please consider switching to the Linux OS. As to the problem...

Issue on running distributed training

@ThomasLimWZ As far as I know, there is no MindSpore API to get the required RAM or graphics memory currently. But I am afraid that the 4GB graphic memory of...

dbnet在Ascend910B3上训练速度非常慢

@zx214 您好，感谢您的反馈。 1. 网络编译耗时超过1小时是不符合预期的。建议您将MindSpore升级到r2.2.12或r2.2.11，并安装匹配的Ascend CANN包。r2.2.12相较于r2.2.0修复了部分bug，并对性能进行了优化。我们在相同的硬件设备上，使用MindSpore r2.2.11训练DBNet ResNet-50，编译耗时在1-2分钟左右。 2. 训练速度慢。首先，注意到您的训练epoch数为1，在训练起始阶段，由于需要进行编译加载等操作，会显著较后续epoch慢。其次，建议将`train.dataset_sink_mode`设置为`True`，打开数据下沉模式能够显著提升训练效能。第三，如果您的CPU单核能力较弱，可尝试将`train.dataset.use_minddata`设置为`True`，或尝试适当调高`train.loader.num_workers`的数值（太高会有反作用），可提升数据预处理计算效率；最后，可尝试调整`system.amp_level`为`O2`或`O3`，将部分运算切换为Ascend硬件亲和性更好的fp16模式，但部分网络的收敛难度会变大（甚至会不收敛），需要配套调整超参数和`train.clip_grad`、`train.clip_norm`等选项。以下是我们在相同的硬件设备上，使用MindSpore r2.2.11训练DBNet ResNet-50的日志，供您参考： ``` text (MindSpore) [ma-user mindocr-main]$DEVICE_ID=2 python tools/train.py -c configs/det/dbnet/db_r50_icdar15.yaml [2024-03-21 16:42:06] mindocr.train INFO...

dbnet在Ascend910B3上训练速度非常慢

@zx214 您好。根据您的描述，似乎是从启动Python脚本，到开始训练计算的耗时较长，而不一定是MindSpore的静态图模型编译耗时较长。建议您考虑使用以下方案，定界出真正耗时较长的代码： 1. 将新增的Summary模块、数据加载模块相关代码，从项目中剥离出来，进行单元测试，测算耗时； 2. 在项目代码`tools/main.py`中，使用Python的`time`包，计时分析各阶段耗时。根据您的描述，耗时较长的原因可能有： 1. `Summary`写入过多冗余信息，导致耗时过长； 2. 加载数据的效率较低，如数据集的单样本分辨率很高，或硬盘读写速率较低； 3. MindSpore Graph/静态图首次执行图编译会消耗相对多的时间，非首次执行图编译则会复用之前生成的编译缓存。如果您每次手工删除了编译缓存，会导致编译时间变长。

dbnet在Ascend910B3上训练速度非常慢

@zx214 您好，根据您反馈的错误日志。因为MindSpore源代码中未包含相关错误信息，该问题可能不是MindSpore引入的。建议您检查一下Ascend的环境配置，包括硬件是否安装正常，CANN包是否安装正确（例如安装完CANN后是否安装了te和hccl的whl包）。

The learning rate detected in the optimizer is not a Parameter type, so it is not recorded. Its type is '_IteratorLearningRate'.

@TanateT Hello, thanks for your feedback. It seems that the problem was caused by an improper definition of the learning rate. You can try to debug the code on PyNative...

The learning rate detected in the optimizer is not a Parameter type, so it is not recorded. Its type is '_IteratorLearningRate'.

@TanateT Hello, I tried to modify the python script `tools/train.py` (refer to [Tutorials](https://www.mindspore.cn/docs/zh-CN/r2.2/api_python/mindspore/mindspore.SummaryCollector.html?highlight=summarycollector#mindspore.SummaryCollector)): ``` python def main(cfg): .... # training model = ms.Model(train_net) summary_collector = SummaryCollector(summary_dir='summary_dir', collect_freq=100) model.train( cfg.scheduler.num_epochs, loader_train,...

master 解码性能问题

感谢您的反馈。我们会安排开发工程师进行测试后，合入您所提供的代码。

RuntimeError: The 'getitem' operation does not support the type [None, Int64].

@sevennotmouse 您好，感谢您的反馈。抱歉回复较晚，不知道您是否已经解决上述问题？ > The supported types of overload function getitem is: [Tuple, Slice], [List, Slice], [Tensor, Ellipsis], [Tuple, Tensor], [List, Number], [Tensor, Slice], [Dictionary, String], [Tensor, Tensor], [String, Number], [Tensor,...

基于CCPD的车牌号检测和识别案例

@Xv-M-S 请先根据CI测试报告，修改代码规范问题，并在完成后，提示管理员重新触发CI，直至通过。。