oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

[BUG] got check error when use oneflow.amp.autocast

Open clackhan opened this issue 3 years ago • 2 comments

复现代码:

import oneflow as flow
x = flow.tensor([2.4,3.5], device="cuda", dtype=flow.float16)

with flow.amp.autocast("cuda", flow.float16):
    y = x.clone()
    y.fill_(2.36)
    print(y.dtype)

上面代码pytorch不会出错,oneflow报以下错误:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    y.fill_(2.36)
  File "/home/hanbinbin/oneflow/python/oneflow/framework/tensor.py", line 286, in _fill
    return flow._C.fill_(self, value)
RuntimeError: Check failed: tensor_impl->tensor_meta()->dtype() == output_tensor_metas.at(i)->dtype()
  File "/home/hanbinbin/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 3337, in operator()
    OpInterpUtil::Dispatch(*op_, {in}, outputs.get(), attrs)
  File "/home/hanbinbin/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 96, in Apply
    internal_->Apply(op_expr, inputs, outputs, ctx)
  File "/home/hanbinbin/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 122, in NaiveInterpret

Error Type: oneflow.ErrorProto.check_failed_error
Aborted (core dumped)

具体情况是这样的,这段代码会在fill_ op前插一个cast op,导致fill_ op输入输出的dtype检查时不一样而挂掉,因为是inplace操作,所以在推导完op的元信息后会检查推导出的结果与output的相同,因为插入了cast op导致interpreter的input变了,fill_ op会根据新的input推导,结果就和output对不上了

clackhan avatar Nov 03 '22 10:11 clackhan

看起来应该是fill这个op没有加到amp list里面导致的,如果fill这个op被认为是black op,那它就会强制将输入转换成float32。

hjchen2 avatar Nov 03 '22 13:11 hjchen2

可以把fill加到clear list中,应该可以解决这里的问题。

hjchen2 avatar Nov 03 '22 13:11 hjchen2