LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

pretrain error

Open paulpaul91 opened this issue 1 year ago • 29 comments

image

paulpaul91 avatar Apr 20 '23 14:04 paulpaul91

Hi, thank you for your interest in our work.

Can you provide the log that contains the actual error message? It should be somewhere above this error message. You may wrap the error log with ``` to make it easier to read in GitHub. Thanks.

haotian-liu avatar Apr 20 '23 16:04 haotian-liu

Hi, I encountered the same error in finetuning, it seems it runs out of RAM.

yix-chen avatar Apr 21 '23 03:04 yix-chen

I have similiar issues.

I can run it with 4 GPUs, but cannot with 8GPUs (A100 80G memeory each, 1.96TB CPU RAM.

My command it

LOGLEVEL=INFO TORCHELASTIC_ENABLE_FILE_TIMER=1 torchrun --nnodes=1 --nproc_per_node=4 --master_port=25001 \
    llava/train/train_mem.py \
    --model_name_or_path /mnt/bd/xxxx/weights/llama-dl-main/vicuna_13B \
    --data_path /mnt/bd/xxxx/experiment/LLaVA/LLaVA-Instruct-150K/chat.json \
    --image_folder /mnt/bd/xxxx/experiment/LLaVA/LLaVA-Instruct-150K/images \
    --vision_tower openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end \
    --bf16 True \
    --output_dir ./checkpoints/llava-13b-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --debug underflow_overflow \
    --report_to none

Logs:

LOGLEVEL=INFO TORCHELASTIC_ENABLE_FILE_TIMER=1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
>     llava/train/train_mem.py \
>     --model_name_or_path /mnt/bd/data-tns-algo-masp-llm/weights/llama-dl-main/vicuna_13B \
>     --data_path /mnt/bd/data-tns-algo-masp-llm/experiment/LLaVA/LLaVA-Instruct-150K/chat.json \
>     --image_folder /mnt/bd/data-tns-algo-masp-llm/experiment/LLaVA/LLaVA-Instruct-150K/images \
>     --vision_tower openai/clip-vit-large-patch14 \
>     --tune_mm_mlp_adapter True \
>     --mm_vision_select_layer -2 \
>     --mm_use_im_start_end \
>     --bf16 True \
>     --output_dir ./checkpoints/llava-13b-pretrain \
>     --num_train_epochs 1 \
>     --per_device_train_batch_size 16 \
>     --per_device_eval_batch_size 4 \
>     --gradient_accumulation_steps 2 \
>     --evaluation_strategy "no" \
>     --save_strategy "steps" \
>     --save_steps 2400 \
>     --save_total_limit 1 \
>     --learning_rate 2e-3 \
>     --weight_decay 0. \
>     --warmup_ratio 0.03 \
>     --lr_scheduler_type "cosine" \
>     --logging_steps 1 \
>     --tf32 True \
>     --model_max_length 2048 \
>     --gradient_checkpointing True \
>     --lazy_preprocess True \
>     --debug underflow_overflow \
>     --report_to none

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : llava/train/train_mem.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 8
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:25001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic__5qd21x9/none_xfx__n23
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=25001
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Starting a FileTimerServer with /tmp/watchdog_timer_01ab64a1-b00c-4ad2-ae61-4607b03008e7 ...
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:FileTimerServer started
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic__5qd21x9/none_xfx__n23/attempt_0/7/error.json
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39409 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39410 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39411 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39412 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39414 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39415 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39416 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 4 (pid: 39413) of binary: /mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00801992416381836 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 4 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
llava/train/train_mem.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-01_11:41:43
  host      : n193-016-074.byted.org
  rank      : 4 (local_rank: 4)
  exitcode  : -9 (pid: 39413)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 39413
======================================================

tensorboy avatar May 01 '23 03:05 tensorboy

@tensorboy, thank you for your interest in our work.

I noticed that you are using the instruction tuning dataset in the pretraining stage. This shall be the reason of OOM, as the instruction tuning dataset have longer responses, comparing to the short captions in CC3M.

Is there any specific reason you want to use instruction tuning with pretraining stage? If so, you may try FSDP. Note that this is experimental as PyTorch has not officially supported PEFT for FSDP yet. I have locally verified with very few iterations (400 iters x 8GPUs x 16 per_gpu_batch_size) that the model behavior is reasonable, and similar to w/o FSDP. Make sure to use the latest code base, as there is an issue fixed this afternoon.

Thanks.

haotian-liu avatar May 01 '23 03:05 haotian-liu

@yix-chen Sorry just saw this comment. Are you still facing this issue, and if so, can you share your hardware setting (GPU type and count), and the command, as well as the error logs? Thanks.

haotian-liu avatar May 01 '23 03:05 haotian-liu

thanks for the quick reply, I think I've made mistakes for the dataset, but where I can download CC3M dataset..

tensorboy avatar May 01 '23 03:05 tensorboy

@tensorboy Please checkout the download instructions here, thanks.

haotian-liu avatar May 01 '23 03:05 haotian-liu

@tensorboy Please checkout the download instructions here, thanks.

Thanks for the quick feedback. I'm still confused for the pretraining command:

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    llava/train/train_mem.py \
    --model_name_or_path ./checkpoints/llama-vicuna-13b \
    --data_path /path/to/cc3m_595k.json \
    --image_folder /path/to/cc3m_595k \
    --vision_tower openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end \
    --bf16 True \
    --output_dir ./checkpoints/llava-13b-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

I cannot find cc3m_595k.json, and the most relevant one is 'chat.json' at here: https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/tree/main

tensorboy avatar May 01 '23 04:05 tensorboy

Hi @tensorboy, chat.json in this repo is the correct one to go with. You may also download and see if len(json.load(...)) is roughly 595K.

haotian-liu avatar May 01 '23 04:05 haotian-liu

Hi @tensorboy, chat.json in this repo is the correct one to go with. You may also download and see if len(json.load(...)) is roughly 595K.

it's 595375

tensorboy avatar May 01 '23 04:05 tensorboy

Yep, that's correct :)

haotian-liu avatar May 01 '23 04:05 haotian-liu

Thank you, what is your suggestions now?

tensorboy avatar May 01 '23 04:05 tensorboy

I am confused. Do you mean that although you were naming the folder as LLaVA-Instruct-150K/chat.json, but it actually comes from this CC3M instead? Because we have another dataset LLaVA-Instruct-150K.

haotian-liu avatar May 01 '23 04:05 haotian-liu

I am confused. Do you mean that although you were naming the folder as LLaVA-Instruct-150K/chat.json, but it actually comes from this CC3M instead?

yes, I think I've put all the data in that directory (LLaVA-Instruct-150K)

tensorboy avatar May 01 '23 04:05 tensorboy

This is strange. Can you try seeing if finetuning works? You can download the ./checkpoints/mm_projector/llava-13b-pretrain.bin here. Since you are using A100-80Gx8, this shall not cause OOM at all.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    llava/train/train_mem.py \
    --model_name_or_path /path/to/llama-vicuna-13b \
    --data_path /path/to/llava_instruct_150k.json \
    --image_folder /Data/haotian/coco/train2014 \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/mm_projector/llava-13b-pretrain.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

haotian-liu avatar May 01 '23 04:05 haotian-liu

haotian

let me try it now

tensorboy avatar May 01 '23 04:05 tensorboy

Also, can you monitor both the GPU RAM, and CPU RAM usage when you are running the code, are they changing before throwing out the error? Your CPU RAM/GPU RAM should be more than sufficient, but we can see what is happening. The peak CPU memory usage may be ~500GB in your case.

haotian-liu avatar May 01 '23 04:05 haotian-liu

Also, can you monitor both the GPU RAM, and CPU RAM usage when you are running the code, are they changing before throwing out the error? Your CPU RAM/GPU RAM should be more than sufficient, but we can see what is happening. The peak CPU memory usage may be ~500GB in your case.

CPU is around 400GB at peak, but GPU memory is 3MB / 81251 MB until it crashes.

tensorboy avatar May 01 '23 04:05 tensorboy

It seems that it does not even start loading the checkpoints. And it crashes when just initializing the model. Honestly, I am not sure what the cause is, as I have not met such error before.

INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 4 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html

Do you think this @record is useful for this case? It seems a similar issue to https://github.com/lm-sys/FastChat/issues/627? Unfortunately there is no solution provided there either.

haotian-liu avatar May 01 '23 04:05 haotian-liu

lm-sys/FastChat#627

It's exactly same issues with that fastchat. I'm not sure what that @record is and how to use that in your code either..

tensorboy avatar May 01 '23 04:05 tensorboy

here

same errors. log:

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
> llava/train/train_mem.py \
> --model_name_or_path /mnt/bd/data-tns-algo-masp-llm/weights/llama-dl-main/vicuna_13B  \
> --data_path /mnt/bd/data-tns-algo-masp-llm/experiment/LLaVA/LLaVA-Instruct-150K/llava_instruct_150k.json \
> --image_folder /mnt/bd/data-tns-algo-masp-llm/experiment/LLaVA/data/coco/train2014 \
> --vision_tower openai/clip-vit-large-patch14 \
> --pretrain_mm_mlp_adapter /mnt/bd/data-tns-algo-masp-llm/experiment/LLaVA/checkpoints/llava-13b-pretrain/mm_projector.bin \
> --mm_vision_select_layer -2 \
> --mm_use_im_start_end True \
> --bf16 True \
> --output_dir ./checkpoints \
> --num_train_epochs 3 \
> --per_device_train_batch_size 4 \
> --per_device_eval_batch_size 4 \
> --gradient_accumulation_steps 1 \
> --evaluation_strategy "no" \
> --save_strategy "steps" \
> --save_steps 5000 \
> --save_total_limit 3 \
> --learning_rate 2e-5 \
> --weight_decay 0. \
> --warmup_ratio 0.03 \
> --lr_scheduler_type "cosine" \
> --logging_steps 1 \
> --tf32 True \
> --fsdp "full_shard auto_wrap" \
> --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
> --model_max_length 2048 \
> --gradient_checkpointing True \
> --lazy_preprocess True \
> --report_to wandb


/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 144298 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 144299 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 144300 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 144301 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 144302 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 144303 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 144304 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 144305) of binary: /mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/bin/python
Traceback (most recent call last):
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
llava/train/train_mem.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-01_13:00:38
  host      : n193-016-074.byted.org
  rank      : 7 (local_rank: 7)
  exitcode  : -9 (pid: 144305)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 144305
=======================================================

tensorboy avatar May 01 '23 05:05 tensorboy

Hi, a random thought: can this be related to ulimit?

Specifically, what's ulimit -u, ulimit -v, ulimit -m on your machine?

And what's your OS type/version, and CUDA version (just to find the disparity between our machines)?

haotian-liu avatar May 01 '23 05:05 haotian-liu

Hi, a random thought: can this be related to ulimit?

Specifically, what's ulimit -u, ulimit -v, ulimit -m on your machine?

And what's your OS type/version, and CUDA version (just to find the disparity between our machines)?

all (ulimit -u, ulimit -v, ulimit -m) will output:

unlimited

tensorboy avatar May 01 '23 05:05 tensorboy

Sad, then that's not the cause either. I am not sure what we can do to debug this. One last attempt is to try what ChatGPT suggests about record:

from torch.distributed.elastic.multiprocessing.errors import record  
  
@record  
def main():  
    # Your code here  

So maybe we can add to llava/train.py?

@record
def train():

haotian-liu avatar May 01 '23 05:05 haotian-liu

from torch.distributed.elastic.multiprocessing.errors import record  
  
@record 

same errors:

OGLEVEL=INFO TORCHELASTIC_ENABLE_FILE_TIMER=1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001     llava/train/train_mem.py     --model_name_or_path /mnt/bd/data-tns-algo-masp-llm/weights/llama-dl-main/vicuna_13B     --data_path /mnt/bd/data-tns-algo-masp-llm/experiment/LLaVA/LLaVA-Instruct-150K/chat.json     --image_folder /mnt/bd/data-tns-algo-masp-llm/experiment/LLaVA/LLaVA-Instruct-150K/images     --vision_tower openai/clip-vit-large-patch14     --tune_mm_mlp_adapter True     --mm_vision_select_layer -2     --mm_use_im_start_end     --bf16 True     --output_dir ./checkpoints/llava-13b-pretrain     --num_train_epochs 1     --per_device_train_batch_size 16     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --evaluation_strategy "no"     --save_strategy "steps"     --save_steps 2400     --save_total_limit 1     --learning_rate 2e-3     --weight_decay 0.     --warmup_ratio 0.03     --lr_scheduler_type "cosine"     --logging_steps 1     --tf32 True     --model_max_length 2048     --gradient_checkpointing True     --lazy_preprocess True     --debug underflow_overflow     --report_to none
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : llava/train/train_mem.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 8
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:25001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=25001
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Starting a FileTimerServer with /tmp/watchdog_timer_f83bf4ec-c486-4774-9b56-c1eda1f9192d ...
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:FileTimerServer started
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/7/error.json
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 216958 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 216959 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 216960 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 216961 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 216962 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 216963 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 216964 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 216957) of binary: /mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.008567571640014648 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 0 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/bd/data-tns-algo-masp-llm/environment/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
llava/train/train_mem.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-01_13:32:26
  host      : n193-016-074.byted.org
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 216957)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 216957
=======================================================

tensorboy avatar May 01 '23 05:05 tensorboy

INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/7/error.json

Anything in these files?

haotian-liu avatar May 01 '23 05:05 haotian-liu

INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_zg87tjfx/none_hh2up_uh/attempt_0/7/error.json

Anything in these files? all is empty: Screen Shot 2023-04-30 at 10 41 01 PM

tensorboy avatar May 01 '23 05:05 tensorboy

😮‍💨 I have no more ideas now. If you somehow solve this issue, please let me know and maybe also share the solution here. Thanks!

haotian-liu avatar May 01 '23 05:05 haotian-liu

I will for sure come back to let you know if I can solve it.

tensorboy avatar May 01 '23 05:05 tensorboy

I will for sure come back to let you know if I can solve it.

hello how's it going?

LetsGoFir avatar Jun 06 '23 02:06 LetsGoFir