tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[BugFix] Fix Crash Cases Caused by "__tvm_meta__ = None"

Open Johnson9009 opened this issue 1 year ago • 15 comments

Fix https://github.com/apache/tvm/issues/16633

The broken tests are below items of "tests/scripts/task_python_unittest.sh"

"tir-transform" "tir-usmp" The crashed cases: tests/python/tir-transform/test_tir_transform_inject_rolling_buffer.py tests/python/tir-usmp/test_tir_usmp_algo.py tests/python/tir-usmp/test_tir_usmp_analysis_extract_bufferinfo.py tests/python/tir-usmp/test_tir_usmp_transform_convert_pool_allocations_to_offsets.py tests/python/tir-usmp/test_tir_usmp_utils.py

other failed cases: tests/python/tir-transform/test_tir_transform_pointer_value_type_rewrite.py::TestRewriteToShuffle::test_compare tests/python/tir-transform/test_tir_transform_force_narrow_index_to_i32.py::test_thread_axis2 tests/python/tir-transform/test_tir_transform_hoist_if.py::test_hoisting_block_scope_4 tests/python/tir-transform/test_transform_default_gpu_schedule.py::test_add_on_metal

Johnson9009 avatar Feb 23 '24 11:02 Johnson9009

Run these test cases in local with this patch, below 3 test cases still will be failed because of TIR structure isn't equal, @Hzfengsy @tqchen can you help to see them? Thanks.

tir-transform: FAILED tests/python/tir-transform/test_tir_transform_force_narrow_index_to_i32.py::test_thread_axis2 FAILED tests/python/tir-transform/test_tir_transform_hoist_if.py::test_hoisting_block_scope_4 FAILED tests/python/tir-transform/test_transform_default_gpu_schedule.py::test_add_on_metal

Johnson9009 avatar Feb 23 '24 15:02 Johnson9009

The reason of why the crashed test cases won't be caught by CI is the pytest exist code is 5, it means no tests is collected, in file "tests/scripts/setup-pytest-env.sh" we haven't treat the exist code of 5 as error. image

I found another test item "tests/python/usmp" have the same exist code 5, because now there isn't the directory "tests/python/usmp", so pytest can't collect any test cases, I think "usmp" is a typo from https://github.com/apache/tvm/pull/16110, so I delete it.

Johnson9009 avatar Feb 26 '24 03:02 Johnson9009

cc @tqchen

Hzfengsy avatar Feb 27 '24 11:02 Hzfengsy

I had this issue come up in https://github.com/apache/tvm/pull/16569. I think these tests probably were not being executed before. I am also pretty sure it's safe to just remove those lines, as __tvm_meta__ is not actually read anywhere.

In my PR, I added a check in the parser to make sure that decl_function would only be called with function definitions, which raises a readable error message and prevents a segfault.

slyubomirsky avatar Feb 27 '24 21:02 slyubomirsky

Run these test cases in local with this patch, below 3 test cases still will be failed because of TIR structure isn't equal, @Hzfengsy @tqchen can you help to see them? Thanks.

tir-transform: FAILED tests/python/tir-transform/test_tir_transform_force_narrow_index_to_i32.py::test_thread_axis2 FAILED tests/python/tir-transform/test_tir_transform_hoist_if.py::test_hoisting_block_scope_4 FAILED tests/python/tir-transform/test_transform_default_gpu_schedule.py::test_add_on_metal

I am also having this error in my PR :joy: For test_tir_transform_force_narrow_index_to_i32, I think it's a real bug (the thread binding in the loop is not being updated). Don't know what's going on with the other two.

slyubomirsky avatar Feb 27 '24 21:02 slyubomirsky

I've figured out, thanks to @Lunderberg's comments in my PR, that the failure in tests/python/tir-transform/test_tir_transform_force_narrow_index_to_i32.py::test_thread_axis2 is due to line 354 in scr/script/ir_builder/tir/ir.cc. If you change

    IterVar iter_var(Range(nullptr), Var("iter", dtype), IterVarType::kThreadIndex, thread);

to

    IterVar iter_var(Range(nullptr), Var("iter", DataType::Int(32)), IterVarType::kThreadIndex, thread);

(what it was before the merge commit that Eric pointed out), the test will pass. However, I'm not sure we want to manually fix the data type like that.

This probably means that we should update the index narrowing pass to deal with the thread index itervar accordingly.

slyubomirsky avatar Mar 01 '24 20:03 slyubomirsky

I did a git bisect for test_tir_transform_hoist_if.py: eb15d04c3bff76062e26d5647fb8af0323de1bed is the first bad commit for tests/python/tir-transform/test_tir_transform_hoist_if.py::test_hoisting_block_scope_4. I will investigate to see if there's a fix hinted in there.

slyubomirsky avatar Mar 01 '24 22:03 slyubomirsky

Bisect for test_transform_default_gpu_schedule.py::test_add_on_metal: First failing commit is 1c35c392648e4336fc5e00ab91abb37af997cd59. It looks to have made changes to test cases, but I'm not sure what exactly is causing the failure. Now that there are commits isolated, it should make it easier to debug.

slyubomirsky avatar Mar 05 '24 22:03 slyubomirsky

Fix for the add_on_metal bug: You need to update IsScheduledOnGPU() in default_gpu_schedule.cc to also include Metal in its definition of GPU.

bool IsScheduledOnGPU(const BaseFunc& func) {
  // the target from context.
  tvm::Target target = tvm::Target::Current();
  // the Target in kTarget attribute of PrimFunc
  Optional<tvm::Target> func_target = func->attrs.GetAttr<tvm::Target>(tvm::attr::kTarget);
  if (func_target.defined()) {
    target = func_target.value();
  }

  if (target.defined()) {
    int dev_type = target->GetTargetDeviceType();
    // Change is here. I don't know if web GPU is quite correct
    if (!(dev_type == kDLCUDA || dev_type == kDLMetal || dev_type == kDLROCM ||
          dev_type == kDLWebGPU)) {
      return false;
    }
  }
  return true;
}

Note that this issue was mentioned in the code review for the PR. I will include this fix in my PR.

slyubomirsky avatar Mar 05 '24 23:03 slyubomirsky

You need to update IsScheduledOnGPU() in default_gpu_schedule.cc to also include Metal in its definition of GPU.

Should we instead check whether the target tags include "gpu"? That would avoid needing to list out each target kind individually.

Lunderberg avatar Mar 06 '24 16:03 Lunderberg

I'm not sure all of them actually contain the string "gpu." I will double check that.

Edit: Indeed, Metal is simply called "Metal"

slyubomirsky avatar Mar 12 '24 01:03 slyubomirsky

These bugs should all be fixed now that #16569 was merged. I recommend you close unless there's a change in here that wasn't covered there.

slyubomirsky avatar Mar 22 '24 01:03 slyubomirsky

https://github.com/apache/tvm/pull/16770 The sage continues...

slyubomirsky avatar Mar 22 '24 18:03 slyubomirsky

#16770 The sage continues...

Hi Steven, I'm little confused. Does these three cases been fixed?

I tried main with commit: 240326, b2204ae6 and update status as below. It seems still two cases failed:

  • still failed: tests/python/tir-transform/test_tir_transform_force_narrow_index_to_i32.py::test_thread_axis2
  • passed: tests/python/tir-transform/test_tir_transform_hoist_if.py::test_hoisting_block_scope_4
  • still failed: tests/python/tir-transform/test_transform_default_gpu_schedule.py::test_add_on_metal

ysh329 avatar Mar 26 '24 03:03 ysh329

#16770 The sage continues...

Hi Steven, I'm little confused. Does these three cases been fixed?

I tried main with commit: 240326, b2204ae and update status as below. It seems still two cases failed:

  • still failed: tests/python/tir-transform/test_tir_transform_force_narrow_index_to_i32.py::test_thread_axis2
  • passed: tests/python/tir-transform/test_tir_transform_hoist_if.py::test_hoisting_block_scope_4
  • still failed: tests/python/tir-transform/test_transform_default_gpu_schedule.py::test_add_on_metal

Test with tvm.main.20240407.81a850693, 3 tests all passed.

ysh329 avatar Apr 09 '24 07:04 ysh329