LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

Training fails with tons of missing imports in llava_trainer.py

Open peterwisu opened this issue 2 months ago • 5 comments

I'm trying to train a model but in llava/train/llava_trainer.py file. It has broken imports everywhere.

I follow the installation in the Readme.md

conda create -n llava python=3.10 -y conda activate llava pip install --upgrade pip # Enable PEP 660 support. pip install -e ".[train]"

But when I ran LLaVA-NeXT/scripts/train/pretrain_siglip.sh

which will called llava/train/train_mem.py -> train.py -> llava_trainer.py

I get error like:

NameError: name 'get_model_param_count' is not defined NameError: name 'is_torch_xla_available' is not defined
NameError: name 'DistributedType' is not defined NameError: name 'DebugOption' is not defined

Which cause from trainer file is missing imports for functions that are actually used in the code. Functions are called but never imported anywhere. I had to manually add these imports just to get past the first few errors:

from transformers.debug_utils import DebugOption, DebugUnderflowOverflow
from transformers.integrations import deepspeed_init
from transformers import TrainerState
from transformers.trainer_pt_utils import get_model_param_count
from transformers.utils import is_torch_xla_available
from accelerate.utils import DistributedType
import math
import time
import numpy as np
import sys

Is this a bug in the code or am I missing something in my installation/setup?

peterwisu avatar Sep 15 '25 00:09 peterwisu

Update: Found the root cause

I found the reason for these import errors. This error is caused by PR #469 which added MeZO support to the LLaVA trainer. The additional support for MeZO required overriding the _inner_training_loop function of the HuggingFace Trainer (as seen in this commit). However, the imports used in the overridden _inner_training_loop function are not properly imported in llava_trainer.py. The missing imports from the original HF Trainer need to fix this issue.

@tatarinovst2 Could you add the missing imports from your MeZO implementation? @Luodian This is blocking users from training - might need a quick fix.

peterwisu avatar Sep 15 '25 02:09 peterwisu

Did you encounter another error? when i ran LLaVA-NeXT/scripts/train/pretrain_siglip.sh, I got the following errors:

NameError: name 'TrainOutput' is not define, 
NameError: name 'plot_graphs_based_on_log_history' is not define, 
NameError: name 'speed_metrics' is not define, 

Sun-9923 avatar Sep 18 '25 13:09 Sun-9923

The same problem.

LunaAndEndymion avatar Sep 22 '25 06:09 LunaAndEndymion

Did you encounter another error? when i ran LLaVA-NeXT/scripts/train/pretrain_siglip.sh, I got the following errors:

NameError: name 'TrainOutput' is not define, 
NameError: name 'plot_graphs_based_on_log_history' is not define, 
NameError: name 'speed_metrics' is not define, 

I think this is probably the same problem. You can solve this issue by cloning the branch before PR https://github.com/LLaVA-VL/LLaVA-NeXT/pull/469

peterwisu avatar Sep 23 '25 00:09 peterwisu

I’m running into the same problem. I opened PR #493 that adds the missing imports and applies minor formatting updates in llava/train/llava_trainer.py. After this change, scripts/train/pretrain_clip.sh runs without import errors on my side.

I’d be grateful for any feedback or suggestions for improvement. Thanks!

naufalso avatar Oct 21 '25 14:10 naufalso