Chinese-CLIP
Chinese-CLIP copied to clipboard
这个问题太折磨了,找不到解决方法,有没有大神看一下
运行sh脚本总会出现未识别的参数main.py: error: unrecognized arguments: --accum-freq=1,脚本和示例一模一样
`usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE]
[--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH]
[--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
[--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
[--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
[--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED]
main.py: error: unrecognized arguments: --accum-freq=1
[2024-04-11 23:52:11,183] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 5808) of binary: /home/amax/.conda/envs/lxl/bin/python3
Traceback (most recent call last):
File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 816, in <module>
main()
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-11_23:52:11
host : amax
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 5808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
`
可以试试下面的命令看看吗
先 cd sdb1/lxl2/Chinese-CLIP-master/
python cn_clip/training/main.py
--train-data=${train_data}
--val-data=${val_data}
--resume=${resume}
${reset_data_offset}
${reset_optimizer}
--logs=${output_base_dir}
--name=${name}
--save-step-frequency=${save_step_frequency}
--save-epoch-frequency=${save_epoch_frequency}
--log-interval=${log_interval}
${report_training_batch_acc}
--context-length=${context_length}
--warmup=${warmup}
--batch-size=${batch_size}
--valid-batch-size=${valid_batch_size}
--valid-step-interval=${valid_step_interval}
--valid-epoch-interval=${valid_epoch_interval}
--lr=${lr}
--accum_freq=${accum_freq}
--wd=${wd}
--max-epochs=${max_epochs}
--vision-model=${vision_model}
${use_augment}
--text-model=${text_model}
--grad-checkpointing
你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数
如果你要用分布式,也可以ps -ef | grep main检查下进程
可以试试下面的命令看看吗
先 cd sdb1/lxl2/Chinese-CLIP-master/
python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing
你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数
如果你要用分布式,也可以ps -ef | grep main检查下进程
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
可以试试下面的命令看看吗 先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数 如果你要用分布式,也可以ps -ef | grep main检查下进程
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到
可以试试下面的命令看看吗 先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数 如果你要用分布式,也可以ps -ef | grep main检查下进程
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数
> > (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
>
> 把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到 
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
Traceback (most recent call last):
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 16, in <module>
from cn_clip.clip import load
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/__init__.py", line 4, in <module>
from .model import convert_state_dict
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/model.py", line 16, in <module>
FlashMHA = importlib.import_module('flash_attn.flash_attention').FlashMHA
File "/home/amax/.conda/envs/lxl/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attention.py", line 7, in <module>
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 5, in <module>
import flash_attn_cuda
ImportError: /home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
这是按照您说的先cd后,再替换脚本中命令行后的结果
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?
pip uninstall flash_attn
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?
pip uninstall flash_attn
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--valid-num-workers VALID_NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc]
[--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL]
[--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
[--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
[--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
[--grad-checkpointing] [--use-flash-attention] [--gather-with-grad] [--skip-aggregate] [--debug] [--seed SEED] [--distllation] [--teacher-model-name TEACHER_MODEL_NAME] [--kd_loss_weight KD_LOSS_WEIGHT]
[--accum-freq ACCUM_FREQ]
main.py: error: unrecognized arguments: --accum_freq=1
额执行完您说的“先 cd sdb1/lxl2/Chinese-CLIP-master/...............”,出现了以上的报错,回到开始了属实是
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last):
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
main()
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
args.local_device_rank = int(os.environ['LOCAL_RANK'])
File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
问题接踵而至。。。
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
问题接踵而至。。。 报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
问题接踵而至。。。 报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以
主要问题已经基本已经解决了,可以先训练了,感谢多日以来的耐心指导,感激之情溢于言表~[抱拳]