oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

RuntimeError: Check failed: (nd_sbp.has_value()) == (this->has_nd_sbp_symbol_id()) (0 vs 1)

Open strint opened this issue 3 years ago • 1 comments
trafficstars

Summary

RuntimeError: Check failed: (nd_sbp.has_value()) == (this->has_nd_sbp_symbol_id()) (0 vs 1) 
  File "/home/xuxiaoyu/dev/oneflow/oneflow/core/functional/impl/global_cast.cpp", line 526, in operator()
    MetaInfoConsistencyCheck(parallel_desc, sbp_parallels, grad_sbp_parallels, 1, check_meta)
  File "/home/xuxiaoyu/dev/oneflow/oneflow/core/framework/consistency_check.cpp", line 253, in MetaInfoConsistencyCheck
    MetaInfoConsistencyCheck(placement, nd_sbp, grad_nd_sbp, debug_level, force_check)
  File "/home/xuxiaoyu/dev/oneflow/oneflow/core/framework/consistency_check.cpp", line 231, in MetaInfoConsistencyCheck
    MetaInfoConsistencyCheckUtil(placement, nd_sbp, grad_nd_sbp)
  File "/home/xuxiaoyu/dev/oneflow/oneflow/core/framework/consistency_check.cpp", line 201, in MetaInfoConsistencyCheckUtil
    ctx->Check()
  File "/home/xuxiaoyu/dev/oneflow/oneflow/core/framework/consistency_check.cpp", line 147, in Check
    flat_meta_info_consistency_->Check(placement_, nd_sbp_, grad_nd_sbp_)
  File "/home/xuxiaoyu/dev/oneflow/oneflow/core/framework/consistency_check.cpp", line 86, in Check
    
Error Type: oneflow.ErrorProto.check_failed_error

System Information

  • OneFlow version (run python3 -m oneflow --doctor): 0.8

strint avatar Sep 07 '22 12:09 strint

When running with a global tensor, some rank has env variable ONEFLOW_DEBUG_MODE=1, and some rank has ONEFLOW_DEBUG_MODE=0, this check error will be raised.

Just make all rank's ONEFLOW_DEBUG_MODE has the save value will fix this check error.

strint avatar Sep 07 '22 12:09 strint