models icon indicating copy to clipboard operation
models copied to clipboard

Resnet npu

Open ShawnXuan opened this issue 5 months ago • 1 comments

这个分支固定了数据集,去掉了随机。

需要准备一个可以加载的初始化模型到 .../models/Vision/classification/image/resnet50/examples/checkpoints/init

比如在910b上可以

cd .../models/Vision/classification/image/resnet50/examples
cp -r /data1/home/xiexuan/git-repos/models/Vision/classification/image/resnet50/examples/checkpoints .

然后就可以运行 ./npu_eager.sh./npu_graph.sh

目前npu eager和cuda eager/graph都对齐了,但npu graph还没有对齐,输出的pred都是 0.001,需要深入调查

loss
tensor(6.9073, placement=oneflow.placement(type="npu", ranks=[0]), sbp=(oneflow.sbp.partial_sum,),
       dtype=oneflow.float32)
pred
tensor([[0.0010, 0.0010, 0.0010,  ..., 0.0010, 0.0010, 0.0010],
        [0.0010, 0.0010, 0.0010,  ..., 0.0010, 0.0010, 0.0010],
        [0.0010, 0.0010, 0.0010,  ..., 0.0010, 0.0010, 0.0010],
        ...,
        [0.0010, 0.0010, 0.0010,  ..., 0.0010, 0.0010, 0.0010],
        [0.0010, 0.0010, 0.0010,  ..., 0.0010, 0.0010, 0.0010],
        [0.0010, 0.0010, 0.0010,  ..., 0.0010, 0.0010, 0.0010]],
       placement=oneflow.placement(type="npu", ranks=[0]), sbp=(oneflow.sbp.split(dim=0),), dtype=oneflow.float32)
label
tensor([582, 209, 272, 331, 768, 626, 838, 202, 333, 754, 435, 955, 853, 943,  40, 723,   3, 104,  51,  60, 118,
        762, 603, 353, 898,  69, 552, 824, 999, 217, 713, 334, 758, 818, 115,   1, 609, 238, 147, 446, 240, 455,
        442, 257, 206, 200, 911, 355, 684, 419], placement=oneflow.placement(type="npu", ranks=[0]),
       sbp=(oneflow.sbp.split(dim=0),), dtype=oneflow.int32)

ShawnXuan avatar Sep 24 '24 08:09 ShawnXuan