DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Guard check fails after deep-compiling a model that calls tensor.expand()

Open eternalNight opened this issue 4 months ago • 4 comments

Describe the bug Trying to deep-compile a model calling tensor.expand() triggers the following guard error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/Playground/gist/deepcompile/extend.py", line 46, in <module>
[rank0]:     o = m(x)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/Projects/deepspeed/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/Projects/deepspeed/deepspeed/runtime/engine.py", line 2106, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
[rank0]:     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
[rank0]:     return self._torchdynamo_orig_callable(
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
[rank0]:     result = self._inner_convert(
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
[rank0]:     return _compile(
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
[rank0]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
[rank0]:     return _compile_inner(code, one_graph, hooks, transform)
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 906, in _compile_inner
[rank0]:     check_fn = CheckFunctionManager(
[rank0]:   File "/venv-3.10/lib/python3.10/site-packages/torch/_dynamo/guards.py", line 2514, in __init__
[rank0]:     raise AssertionError(f"Guard check failed: {reasons}")
[rank0]: AssertionError: Guard check failed: 0/0: tensor 'self._parameters['cls_token']' rank mismatch. expected 3, actual 1. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.

To Reproduce Run this script: https://gist.github.com/eternalNight/89ad0639abba0d51ca7777a91d0b07a0

Expected behavior Forward graph is generated without triggering any guard check error.

System info (please complete the following information):

  • Python version: 3.10.13
  • Deepspeed commit: 43f00ba3 Remove additional unused tests (human-eval) (#7445) (i.e. v0.17.2-16-g43f00ba3)

eternalNight avatar Aug 04 '25 08:08 eternalNight

The self._parameters['cls_token']' rank mismatch reason of the failure could be misleading. Even with a model that does compile (e.g., https://gist.github.com/eternalNight/89ad0639abba0d51ca7777a91d0b07a0 with line 39-40 commented out), guard check succeeds but check_verbose fails:

> /venv-3.10/lib/python3.10/site-packages/torch/_dynamo/guards.py(2509)__init__()
   2508             ipdb.set_trace()
-> 2509             if not self.guard_manager.check(output_graph.local_scope):
   2510                 reasons = get_guard_fail_reason_helper(

ipdb> self.guard_manager.check(output_graph.local_scope)
True
ipdb> self.guard_manager.check_verbose(output_graph.local_scope).result
False
ipdb> self.guard_manager.check_verbose(output_graph.local_scope).verbose_code_parts
["tensor 'L['self']._modules['linear']._parameters['bias']' size mismatch at index 0. expected 1024, actual 0. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters."]

eternalNight avatar Aug 04 '25 09:08 eternalNight

Hi there, so is the question fixed or any instruction to avoid?

StackChan avatar Nov 06 '25 03:11 StackChan

same problem

zuoyanzhang avatar Nov 06 '25 03:11 zuoyanzhang

Hi there, so is the question fixed or any instruction to avoid?

I don't find an opportunity to dig into torch graph guard logic for the root cause yet. As a workaround, you can patch the logic to avoid such problematic guards to be inserted:

vanilla_add_tensor_match_guard = torch._C._dynamo.guards.GuardManager.add_tensor_match_guard
def add_tensor_match_guard(self, value, sizes, strides, tensor_name, verbose_code_parts):
    if "cls_token" not in tensor_name and "reg_token" not in tensor_name:
        vanilla_add_tensor_match_guard(self, value, sizes, strides, tensor_name, verbose_code_parts)
torch._C._dynamo.guards.GuardManager.add_tensor_match_guard = add_tensor_match_guard

The tensor names can be found in the assertion failure message.

Kindly note that this is an ugly workaround. Make sure you have other guards (e.g., on the inputs) that can still detect shape changes.

eternalNight avatar Nov 06 '25 05:11 eternalNight