models
models copied to clipboard
Resnet npu
这个分支固定了数据集,去掉了随机。
需要准备一个可以加载的初始化模型到 .../models/Vision/classification/image/resnet50/examples/checkpoints/init
比如在910b上可以
cd .../models/Vision/classification/image/resnet50/examples
cp -r /data1/home/xiexuan/git-repos/models/Vision/classification/image/resnet50/examples/checkpoints .
然后就可以运行 ./npu_eager.sh
或 ./npu_graph.sh
目前npu eager和cuda eager/graph都对齐了,但npu graph还没有对齐,输出的pred都是 0.001,需要深入调查
loss
tensor(6.9073, placement=oneflow.placement(type="npu", ranks=[0]), sbp=(oneflow.sbp.partial_sum,),
dtype=oneflow.float32)
pred
tensor([[0.0010, 0.0010, 0.0010, ..., 0.0010, 0.0010, 0.0010],
[0.0010, 0.0010, 0.0010, ..., 0.0010, 0.0010, 0.0010],
[0.0010, 0.0010, 0.0010, ..., 0.0010, 0.0010, 0.0010],
...,
[0.0010, 0.0010, 0.0010, ..., 0.0010, 0.0010, 0.0010],
[0.0010, 0.0010, 0.0010, ..., 0.0010, 0.0010, 0.0010],
[0.0010, 0.0010, 0.0010, ..., 0.0010, 0.0010, 0.0010]],
placement=oneflow.placement(type="npu", ranks=[0]), sbp=(oneflow.sbp.split(dim=0),), dtype=oneflow.float32)
label
tensor([582, 209, 272, 331, 768, 626, 838, 202, 333, 754, 435, 955, 853, 943, 40, 723, 3, 104, 51, 60, 118,
762, 603, 353, 898, 69, 552, 824, 999, 217, 713, 334, 758, 818, 115, 1, 609, 238, 147, 446, 240, 455,
442, 257, 206, 200, 911, 355, 684, 419], placement=oneflow.placement(type="npu", ranks=[0]),
sbp=(oneflow.sbp.split(dim=0),), dtype=oneflow.int32)