Transformer Engine memory-efficient initialization to convert_model for large models
When attempting to convert large models (e.g., Llama-405) to use transformer_engine layers via the convert_model function, I'm encountering out-of-memory (OOM) errors. This seems to happen because the current implementation keeps both original and transformed modules in memory while copying weights.
A mechanism to defer weight initialization until after the convert_model function completes would significantly improve memory efficiency when working with large-scale models.
Sample accelerate.config which OOMs while converting large models
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: "no"
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_process_ip: ****
main_process_port: 29603
main_training_function: main
mixed_precision: bf16
num_machines: 5
num_processes: 40
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Hi, is #3646 something you're looking for? Not entirely familiar with TE.
ok i think i not understand your query or issue properly i think you talking about something nvidia/transformer_engin i am right ? first how to convert accelerate LLM to transformer_engin i search on google about that but i did not found anything useful for me to understand this , can you help me pleaseeeee > @S1ro1 @mayukh-stackav
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.