accelerate
accelerate copied to clipboard
Pipeline parallelism examples with Pippy fails
System Info
- `Accelerate` version: 0.35.0.dev0
- Platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35
- `accelerate` bash location: redacted
- Python version: 3.10.14
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1007.59 GB
- GPU type: NVIDIA H100 PCIe
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction
- Run the llama.py example in distributed inference folder.
- Getting the following error
torch._dynamo.exc.UserError: Dynamic control flow is not supported at the moment. Please use functorch.experimental.control_flow.cond to explicitly capture the control flow.
For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#cond-operands
Expected behavior
I have tried using different accelerate launch flags such as --dynamo_use_dynamic, however I am not sure how to fix the above error.
@goelayu can you try upgrading your python version? IIRC that can play a role. (3.12 ideally)
@muellerzr still the same error.
Accelerateversion: 0.35.0.dev0- Platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35
acceleratebash location: /home/goelayus/miniforge3/envs/myenv/bin/accelerate- Python version: 3.12.7
- Numpy version: 2.1.2
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1007.59 GB
- GPU type: NVIDIA H100 PCIe
Acceleratedefault config: Not found
Here's the entire stack trace if that helps. Seems like some kind of versioning mismatch.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1006, in _trace_with_export
[rank0]: ep = torch.export.export(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/__init__.py", line 174, in export
[rank0]: return _export(
[rank0]: ^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 945, in wrapper
[rank0]: raise e
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 928, in wrapper
[rank0]: ep = fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 89, in wrapper
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 1533, in _export
[rank0]: exported_program = ExportedProgram(
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 246, in __init__
[rank0]: self.verifier().check(self)
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 154, in check
[rank0]: self._check_graph_module(ep.graph_module)
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 220, in _check_graph_module
[rank0]: _check_val(node)
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 62, in _check_val
[rank0]: raise SpecViolationError(f"Node.meta {node.name} is missing val field.")
[rank0]: torch._export.verifier.SpecViolationError: Node.meta _enter_autocast is missing val field.
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/goelayus/Research/inference/LLMInfer/gpu-gpu/accelerate/examples/inference/pippy/llama.py", line 38, in <module>
[rank0]: model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 170, in prepare_pippy
[rank0]: stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 87, in build_pipeline
[rank0]: pipe = pipeline(
[rank0]: ^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1229, in pipeline
[rank0]: return Pipe.from_tracing(
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1044, in from_tracing
[rank0]: exported_program = Pipe._trace_with_export(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1012, in _trace_with_export
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
[rank0]:[W1014 14:23:47.244784990 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1006, in _trace_with_export
[rank1]: ep = torch.export.export(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/__init__.py", line 174, in export
[rank1]: return _export(
[rank1]: ^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 945, in wrapper
[rank1]: raise e
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 928, in wrapper
[rank1]: ep = fn(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 89, in wrapper
[rank1]: return fn(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 1533, in _export
[rank1]: exported_program = ExportedProgram(
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 246, in __init__
[rank1]: self.verifier().check(self)
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 154, in check
[rank1]: self._check_graph_module(ep.graph_module)
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 220, in _check_graph_module
[rank1]: _check_val(node)
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 62, in _check_val
[rank1]: raise SpecViolationError(f"Node.meta {node.name} is missing val field.")
[rank1]: torch._export.verifier.SpecViolationError: Node.meta _enter_autocast is missing val field.
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/goelayus/Research/inference/LLMInfer/gpu-gpu/accelerate/examples/inference/pippy/llama.py", line 38, in <module>
[rank1]: model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 170, in prepare_pippy
[rank1]: stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 87, in build_pipeline
[rank1]: pipe = pipeline(
[rank1]: ^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1229, in pipeline
[rank1]: return Pipe.from_tracing(
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1044, in from_tracing
[rank1]: exported_program = Pipe._trace_with_export(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1012, in _trace_with_export
[rank1]: raise RuntimeError(
[rank1]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
W1014 14:23:48.161000 140104898389824 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3229053 closing signal SIGTERM
E1014 14:23:48.777000 140104898389824 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3229052) of binary: /home/goelayus/miniforge3/envs/myenv/bin/python3.12
Traceback (most recent call last):
File "/home/goelayus/miniforge3/envs/myenv/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
llama.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-14_14:23:48
host : syrax-41
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3229052)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.