Megatron-LM
Megatron-LM copied to clipboard
[BUG]Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format
Describe the bug Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format, the error is below:
AttributeError: 'Tokenizer' object has no attribute 'vocab_size'
To Reproduce
- I git clone https://github.com/meta-llama/llama3.git and use
pip install -e .
to install llama3-0.0.1 wheel following to llama_mistral.md; - I try to convert llama3-8B model downloading from https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main into mcore format , and the script I use is below:
TP=8
PP=2
MODEL_SIZE=llama3-8B
HF_FORMAT_DIR=/workspace/model_weights/llama3-8b
MEGATRON_FORMAT_DIR=${HF_FORMAT_DIR}-tp${TP}-pp${PP}
TOKENIZER_MODEL=${HF_FORMAT_DIR}/original/tokenizer.model
python tools/checkpoint/convert.py \
--model-type GPT \
--loader llama_mistral \
--saver mcore \
--checkpoint-type hf \
--model-size ${MODEL_SIZE} \
--load-dir ${HF_FORMAT_DIR} \
--save-dir ${MEGATRON_FORMAT_DIR} \
--tokenizer-model ${TOKENIZER_MODEL} \
--target-tensor-parallel-size ${TP} \
--target-pipeline-parallel-size ${PP} \
--bf16
Stack trace/logs
Loaded loader_llama_mistral as the loader.
Loaded saver_mcore as the saver.
Starting saver...
Starting loader...
using world size: 1, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
using torch.float32 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
add_bias_linear ................................. False
add_position_embedding .......................... False
add_qkv_bias .................................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_save ...................................... None
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. False
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
calculate_per_token_loss ........................ False
check_for_nan_in_loss_and_grad .................. True
check_weight_hash_across_dp_replicas_interval ... None
ckpt_assume_constant_structure .................. False
ckpt_fully_parallel_load ........................ False
ckpt_fully_parallel_save ........................ False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 1
create_attention_mask_in_dataloader ............. True
cross_entropy_loss_fusion ....................... False
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... None
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_average_in_collective ....................... False
ddp_bucket_size ................................. None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
delay_grad_reduce ............................... True
delay_param_gather .............................. False
deprecated_use_mcore_models ..................... False
deterministic_mode .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_straggler_on_startup .................... False
dist_ckpt_format ................................ torch_dist
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_one_logger ............................... False
encoder_num_layers .............................. 32
encoder_seq_length .............................. 4096
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 100
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
expert_model_parallel_size ...................... 1
ffn_hidden_size ................................. 14336
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 1024
gradient_accumulation_fusion .................... True
group_query_attention ........................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 4096
hybrid_attention_ratio .......................... 0.0
hybrid_mlp_ratio ................................ 0.0
hybrid_override_pattern ......................... None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
iteration ....................................... 1
kv_channels ..................................... 128
lazy_mpu_init ................................... None
load ............................................ /workspace/model_weights/llama3-8b
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 100
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
logging_level ................................... None
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. None
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. linear
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
lr_wsd_decay_iters .............................. None
lr_wsd_decay_samples ............................ None
lr_wsd_decay_style .............................. exponential
make_vocab_size_divisible_by .................... 128
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 8192
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 0.0
mmap_bin_files .................................. True
mock_data ....................................... False
moe_aux_loss_coeff .............................. 0.0
moe_expert_capacity_factor ...................... None
moe_extended_tp ................................. False
moe_grouped_gemm ................................ False
moe_input_jitter_eps ............................ None
moe_layer_recompute ............................. False
moe_pad_expert_input_to_capacity ................ False
moe_per_layer_logging ........................... False
moe_router_load_balancing_type .................. aux_loss
moe_router_topk ................................. 2
moe_token_dispatcher_type ....................... allgather
moe_token_drop_policy ........................... probs
moe_z_loss_coeff ................................ None
nccl_communicator_config_path ................... None
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_save_optim ................................... True
no_save_rng ..................................... True
norm_epsilon .................................... 1e-05
normalization ................................... RMSNorm
num_attention_heads ............................. 32
num_channels .................................... 3
num_classes ..................................... 1000
num_dataset_builder_threads ..................... 1
num_experts ..................................... None
num_layers ...................................... 32
num_layers_per_virtual_pipeline_stage ........... None
num_query_groups ................................ 8
num_workers ..................................... 2
one_logger_entity ............................... hwinf_dcm
one_logger_project .............................. e2e-tracking
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
overlap_grad_reduce ............................. False
overlap_p2p_comm ................................ False
overlap_param_gather ............................ False
override_opt_param_scheduler .................... False
padded_vocab_size ............................... 128256
params_dtype .................................... torch.float32
patch_dim ....................................... 16
perform_initialization .......................... False
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... rope
pretrained_checkpoint ........................... None
profile ......................................... False
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
qk_layernorm .................................... False
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ None
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
rotary_base ..................................... 10000
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_seq_len_interpolation_factor ............. None
sample_rate ..................................... 1.0
save ............................................ None
save_interval ................................... None
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 4096
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_train ...................................... False
spec ............................................ None
split ........................................... None
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
swiglu .......................................... True
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. /workspace/model_weights/llama3-8b/original/tokenizer.model
tokenizer_type .................................. Llama3Tokenizer
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. None
train_iters ..................................... None
train_samples ................................... None
transformer_impl ................................ transformer_engine
transformer_pipeline_model_parallel_size ........ 1
untie_embeddings_and_output_weights ............. True
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cpu_initialization .......................... True
use_dist_ckpt ................................... False
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_legacy_models ............................... False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. True
use_tp_pp_dp_mapping ............................ False
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... 128256
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 1
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1024
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
Traceback (most recent call last):
File "/workspace/megatron/tools/checkpoint/convert.py", line 154, in <module>
Loader exited, exiting saver
main()
File "/workspace/megatron/tools/checkpoint/convert.py", line 147, in main
loader.load_checkpoint(queue, args)
File "/workspace/megatron/tools/checkpoint/loader_llama_mistral.py", line 663, in load_checkpoint
_load_checkpoint(queue, args)
File "/workspace/megatron/tools/checkpoint/loader_llama_mistral.py", line 563, in _load_checkpoint
md.true_vocab_size = tokenizer.vocab_size
AttributeError: 'Tokenizer' object has no attribute 'vocab_size'
Environment (please complete the following information):
- Megatron-LM commit ID: 86850db
- Nvidia pytorch docker image: nvcr.io/nvidia/pytorch:24.06-py3
Additional Found:
It seems that tools/checkpoint/convert.py
would call the function from tools/checkpoint/loader_llama_mistral.py
to load the HF format checkpoint, and when the case is llama3
, it would use Llama3Tokenizer
to get true vocab size as shown in 557~563 line in tools/checkpoint/loader_llama_mistral.py
.
But I check out the Llama3Tokenizer definition which is from https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py
and did not find the attribute vocab_size
, but maybe it means the same attribute as following line show (which is defined in 86 line from https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py:
self.n_words: int = self.model.n_vocab
So I change the code md.true_vocab_size = tokenizer.vocab_size
into md.true_vocab_size = tokenizer.n_words
, and it successfully converted to mcore format.
But I'm still not sure whether it's a bug or whether the operation I did is wrong that cause the failure of converting llama3-8B model.