The error in loading Llama pretrain checkpoint for NeVa(LLAVA)
when I train the Neva model, I got following error
[NeMo I 2024-04-12 03:38:58 neva_model:252] Loading LLM weights from checkpoint /home/nemo/llama_weights/vicuna-2-7b.nemo Loading distributed checkpoint with TensorStoreLoadShardedStrategy Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.val_check_interval=1000', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', 'trainer.max_steps=1000', 'model.megatron_amp_O2=True', 'model.micro_batch_size=1', 'model.global_batch_size=2', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.mcore_gpt=True', 'model.transformer_engine=True', 'model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json', 'model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model', 'model.encoder_seq_length=4096', 'model.num_layers=32', 'model.hidden_size=4096', 'model.ffn_hidden_size=16384', 'model.num_attention_heads=32', 'model.normalization=layernorm1p', 'model.do_layer_norm_weight_decay=False', 'model.apply_query_key_layer_scaling=True', 'model.activation=squared-relu', 'model.headscale=False', 'model.position_embedding_type=rope', 'model.rotary_percentage=0.5', 'model.num_query_groups=null', 'model.data.num_workers=0', 'model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo', 'model.mm_cfg.llm.model_type=nvgpt', 'model.data.conv_template=nvgpt', 'model.mm_cfg.vision_encoder.from_pretrained=/home/nemo/openai_weights/clip-vit-large-patch14-336', 'model.mm_cfg.vision_encoder.from_hf=True', 'model.data.image_token_len=256', 'model.optim.name=fused_adam', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', 'exp_manager.wandb_logger_kwargs.project=neva_demo'] Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/tensorstore.py", line 123, in open_ts_array arr = ts.open(ts.Spec(spec), open=True).result() ValueError: NOT_FOUND: Error opening "zarr" driver: Metadata at local file "/tmp/tmpe2_bw_kv/model_weights/model.decoder.layers.self_attention.linear_qkv.layer_norm_bias/.zarray" does not exist [source locations='tensorstore/driver/kvs_backed_chunk_driver.cc:1255\ntensorstore/driver/driver.cc:114'] [tensorstore_spec='{"context":{"cache_pool":{},"data_copy_concurrency":{},"file_io_concurrency":{},"f
Steps/Code to reproduce bug
First, I used following script to convert the Llama hf checkpoint to Nemo checkpoint (I try Vicuna and Llama both, but I got the same error):
python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path /data1/weight/llama_weights/models--lmsys--vicuna-7b-v1.5 --output_path /home/nemo/llama_weights/vicuna-2-7b.nemo
Then, I launch the train process (I tried 1 gpu and 8 gpu, but I got the same error):
CUDA_VISIBLE_DEVICES=2 NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=1 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py
trainer.precision=bf16
trainer.num_nodes=1
trainer.devices=1
trainer.val_check_interval=1000
trainer.limit_val_batches=5
trainer.log_every_n_steps=1
trainer.max_steps=1000
model.megatron_amp_O2=True
model.micro_batch_size=1
model.global_batch_size=2
model.tensor_model_parallel_size=1
model.pipeline_model_parallel_size=1
model.mcore_gpt=True
model.transformer_engine=True
model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json
model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain
model.tokenizer.library=sentencepiece
model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model
model.encoder_seq_length=4096
model.num_layers=32
model.hidden_size=4096
model.ffn_hidden_size=16384
model.num_attention_heads=32
model.normalization=layernorm1p
model.do_layer_norm_weight_decay=False
model.apply_query_key_layer_scaling=True
model.activation=squared-relu
model.headscale=False
model.position_embedding_type=rope
model.rotary_percentage=0.5
model.num_query_groups=null
model.data.num_workers=0
model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo
model.mm_cfg.llm.model_type=nvgpt
model.data.conv_template=nvgpt
model.mm_cfg.vision_encoder.from_pretrained='/home/nemo/openai_weights/clip-vit-large-patch14-336'
model.mm_cfg.vision_encoder.from_hf=True
model.data.image_token_len=256
model.optim.name="fused_adam"
exp_manager.create_checkpoint_callback=True
exp_manager.create_wandb_logger=False
exp_manager.wandb_logger_kwargs.project=neva_demo
Expected behavior
the training should start
Environment overview (please complete the following information)
I am in the main brach, I use the docker following:
sudo docker run --runtime=nvidia --gpus all -it --rm -v ~/project/NeMo:/opt/NeMo
-v /home/nemo:/home/nemo
-v /data1:/data1
--shm-size=8g -p 8888:8888
--ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.01.speech
Environment details
I try to compile the Nemo in the docker. however, It dose not work.
Additional context
8 H800 GPU I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86
I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Does this error solved? It seems that _load_state_dict_from_disk expects model.ckpt file, but the result of untar model.nemo generates model_weights folder.
same issue I manage to run the pretraining script by setting model.mm_cfg.llm.from_pretrained=null and it works but is he seems to pretraing the llm from scratch (?)