packed errors
I test the torchtune 0.5 with my A6000*4 linux PC
when I try to use packed dataset to accelerate the training, I faced this error
errors
Using flex attention for attention computation since a BlockMask was passed in.
Traceback (most recent call last):
File "/home/cine/miniconda3/envs/tune/bin/tune", line 8, in <module>
sys.exit(main())
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torchtune/_cli/run.py", line 214, in _run_cmd
self._run_single_device(args, is_builtin=is_builtin)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torchtune/_cli/run.py", line 108, in _run_single_device
runpy.run_path(str(args.recipe), run_name="__main__")
File "/home/cine/miniconda3/envs/tune/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/cine/miniconda3/envs/tune/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/cine/miniconda3/envs/tune/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/recipes/lora_finetune_single_device.py", line 803, in <module>
sys.exit(recipe_main())
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
sys.exit(recipe_main(conf))
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/recipes/lora_finetune_single_device.py", line 798, in recipe_main
recipe.train()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/recipes/lora_finetune_single_device.py", line 707, in train
current_loss.backward()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
return user_fn(self, *args)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2048, in backward
out = call_compiled_backward()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1954, in call_compiled_backward
CompiledFunction.compiled_bw = aot_config.bw_compiler(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 51, in _wrapped_bw_compiler
return disable(disable(bw_compiler)(*args, **kwargs))
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1466, in bw_compiler
return inner_compile(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 475, in compile_fx_inner
return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 661, in _compile_fx_inner
compiled_graph = FxGraphCache.load(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1370, in load
compiled_graph = compile_fx_fn(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile
compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1913, in compile_to_fn
return self.compile_to_module().call
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1839, in compile_to_module
return self._compile_to_module()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1867, in _compile_to_module
mod = PyCodeCache.load_by_key_path(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2876, in load_by_key_path
mod = _reload_python_module(key, path)
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_cine/al/calfe7ti75mdabcy4jy6oe7kidirl3nyvoolgrkunuzqik4lzmdn.py", line 830, in <module>
async_compile.wait(globals())
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/async_compile.py", line 276, in wait
scope[key] = result.result()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 3344, in result
self.kernel.precompile()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 250, in precompile
raise e
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 244, in precompile
compiled_binary, launcher = self._precompile_config(
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 452, in _precompile_config
binary._init_handles()
File "/home/cine/miniconda3/envs/tune/lib/python3.10/site-packages/triton/compiler/compiler.py", line 374, in _init_handles
raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources:
out of resource: shared memory, Required: 131074, Hardware limit: 101376.
Reducing block sizes or `num_stages` may help.
configs
output_dir: ./lora_single_device_output/Llama-2-7b-hf/
# Model Arguments
model:
_component_: torchtune.models.llama2.lora_llama2_7b
lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
apply_lora_to_mlp: True
apply_lora_to_output: False
lora_rank: 8 # higher increases accuracy and memory
lora_alpha: 16 # usually alpha=2*rank
lora_dropout: 0.0
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: ./models/Llama-2-7b-hf/tokenizer.model
max_seq_len: 1024
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: ./models/Llama-2-7b-hf
checkpoint_files: [
pytorch_model-00001-of-00002.bin,
pytorch_model-00002-of-00002.bin
]
adapter_checkpoint: null
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False
# Dataset and Sampler
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
packed: True # True increases speed
seed: null
shuffle: True
batch_size: 1
# Optimizer and Scheduler
optimizer:
_component_: torch.optim.AdamW
fused: True
weight_decay: 0.01
lr: 3e-4
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 100
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
# Training
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 8 # Use to increase effective batch size
compile: True # torch.compile the model + loss, True increases speed + decreases memory
# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True
# Environment
device: cuda
dtype: bf16
# Activations Memory
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: True # True reduces memory
# Show case the usage of pytorch profiler
# Set enabled to False as it's only needed for debugging training
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False
#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs
#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True
#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False
# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 5
active_steps: 2
num_cycles: 1
also tried to modify batch size
tokenizer.max_seq_len: 1024
dataset.packed: True # True
batch_size: 1
gradient_accumulation_steps: 16 # Use to increase effective batch size
you could check A blog in Chiense for more details.
internLM support
by the way, I want to add a PR for internLM LLM, is there any work in process?
Hi @chg0901 thanks for creating the issue. Our packed dataset implementation uses flex attention under the hood to support the necessary block causal mask while still retaining good performance. Unfortunately there are some nuances here -- specifically flex attention hardcodes some kernel configs depending on the type of hardware you're using, and these aren't currently optimized for A6000. I would check this comment (along with others in the same thread for more context) for one way to get around this in the short term. This is a known issue in PyTorch core (see https://github.com/pytorch/pytorch/issues/133254) and longer-term, the flex attention authors are working on fixing this -- see https://github.com/pytorch/pytorch/pull/137959 (I believe the compute_capability == (8, 6) case in that PR corresponds to A6000). So one suggestion is to try hardcoding the kernel options as a temporary fix. If this works, we can try to figure out a way to support this in the interim to make the process a bit less painful.
Regarding internLM, we aren't currently working on enabling it. Can you open a separate issue with a formal feature request? It'd be helpful to gauge community interest before opening a PR.
The following configuration encounters the same issue on L40s.
# Tokenizer
tokenizer:
_component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
path: /model/Qwen2.5-7B-Instruct/vocab.json
merges_file: /model/Qwen2.5-7B-Instruct/merges.txt
max_seq_len: 1024
# Dataset
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
packed: True # True increases speed
source: /data/alpaca-cleaned
seed: null
shuffle: True
compile: False
The following configuration encounters the same issue on L40s.
# Tokenizer tokenizer: _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer path: /model/Qwen2.5-7B-Instruct/vocab.json merges_file: /model/Qwen2.5-7B-Instruct/merges.txt max_seq_len: 1024 # Dataset dataset: _component_: torchtune.datasets.alpaca_cleaned_dataset packed: True # True increases speed source: /data/alpaca-cleaned seed: null shuffle: True compile: False
and it can work on Qwen2.5-0.5B-Instruct
It likely works on 0.5B because the head_dim is smaller. As another temporary suggestion for anyone blocked, I'm pretty sure you can make it work on L40s if you change the way flex is compiled to let it find a kernel that is compatible with the cuda smem of the machine, i.e. change: https://github.com/pytorch/torchtune/blob/27fd3a14b04b5c3d428c723ef4a3a27e1595102b/torchtune/modules/attention_utils.py#L25
to: flex_attention_compiled = torch.compile( flex_attention, dynamic=False, mode="max_autotune" )
Alternatively, you can turn off flex attention by hard coding _SUPPORTS_FLEX_ATTENTION = False which will still allow for packed=True.
Does this setting or configuration work for other models such as llama series?
Mircea Mironenco @.***> 于 2025年1月7日周二 16:55写道:
It likely works on 0.5B because the head_dim is smaller. As another temporary suggestion for anyone blocked, I'm pretty sure you can make it work on L40s if you change the way flex is compiled to let it find a kernel that is compatible with the cuda smem of the machine, i.e. change:
https://github.com/pytorch/torchtune/blob/27fd3a14b04b5c3d428c723ef4a3a27e1595102b/torchtune/modules/attention_utils.py#L25
to: flex_attention_compiled = torch.compile( flex_attention, dynamic=False, mode="max_autotune" )
Alternatively, you can turn off flex attention by hard coding _SUPPORTS_FLEX_ATTENTION = False which will still allow for packed=True.
— Reply to this email directly, view it on GitHub https://github.com/pytorch/torchtune/issues/2218#issuecomment-2574608328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB636WB6ETUC437WAWBGWHT2JOB6BAVCNFSM6AAAAABUNHOUS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZUGYYDQMZSHA . You are receiving this because you were mentioned.Message ID: @.***>
max_autotune will likely work for any model/device, assuming the problem is as described (triton kernel OOM). And if it doesn't, setting _SUPPORTS_FLEX_ATTENTION = False should work as you fall back to F.sdpa.
To be clear, you will need to uninstall whatever version of torchtune you have, clone the repo, make the change I described (either compile with max_autotune or turn off flex attention) and then pip install . from the local repo.
max_autotunewill likely work for any model/device, assuming the problem is as described (triton kernel OOM). And if it doesn't, setting_SUPPORTS_FLEX_ATTENTION = Falseshould work as you fall back to F.sdpa.To be clear, you will need to uninstall whatever version of torchtune you have, clone the repo, make the change I described (either compile with
max_autotuneor turn off flex attention) and thenpip install .from the local repo.
The following code can run normally now.
flex_attention_compiled = torch.compile(flex_attention, dynamic=False, mode="max-autotune")
and _SUPPORTS_FLEX_ATTENTION = False fall back to F.sdpa, the training speed drops by 50%.
Finally, what I want to ask is, do you have any recommended materials for comparing the algorithmic performance of flex_attention and F.sdpa?
The following code can run normally now.
flex_attention_compiled = torch.compile(flex_attention, dynamic=False, mode="max-autotune")and
_SUPPORTS_FLEX_ATTENTION = Falsefall back to F.sdpa, the training speed drops by 50%.Finally, what I want to ask is, do you have any recommended materials for comparing the algorithmic performance of flex_attention and F.sdpa?
Just to make sure I understand, it works normally with mode = "max-autotune" right? Only one of max-autotune or SUPPORTS_FLEX_ATTENTION = False is necessary. The performance drop I assume is for the second choice only?
In the case of document masking (which we are doing with packed=True) flex will make use of block sparsity in the attention mask, so the performance gain might come from there. If you want to make F.sdpa faster, you could try different sdpa backends as they might offer a better performance (depending on what backend your current pytorch version is defaulting to, try use 2.5.1 or nightly). This means making another change to attention_utils.py in _sdpa_or_flex_attention. You can select a backend as follows:
from torch.nn.attention import sdpa_kernel, SDPBackend
with sdpa_kernel([SDPBackend.EFFICIENT_ATTENTION]):
output = nn.functional.scaled_dot_product_attention(...)
and try out SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION, or SDPBackend.CUDNN_ATTENTION.
Just to make sure I understand, it works normally with
mode = "max-autotune"right? Only one ofmax-autotuneorSUPPORTS_FLEX_ATTENTION = Falseis necessary. The performance drop I assume is for the second choice only?
yes,that's right.
and try out
SDPBackend.FLASH_ATTENTION,SDPBackend.EFFICIENT_ATTENTION, orSDPBackend.CUDNN_ATTENTION
In addition, SDPBackend.EFFICIENT_ATTENTION can run normally, but there is little change in the training speed. For SDPBackend.FLASH_ATTENTION and SDPBackend.CUDNN_ATTENTION, a "No available kernel" error occurs.