Chinese-CLIP 这个问题太折磨了，找不到解决方法，有没有大神看一下

运行sh脚本总会出现未识别的参数main.py: error: unrecognized arguments: --accum-freq=1，脚本和示例一模一样

`usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE]
               [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH]
               [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
               [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
               [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
               [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED]
main.py: error: unrecognized arguments: --accum-freq=1
[2024-04-11 23:52:11,183] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 5808) of binary: /home/amax/.conda/envs/lxl/bin/python3
Traceback (most recent call last):
  File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 816, in <module>
    main()
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-11_23:52:11
  host      : amax
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 5808)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
`

Apr 11 '24 16:04 iWangTing

可以试试下面的命令看看吗

先 cd sdb1/lxl2/Chinese-CLIP-master/

python cn_clip/training/main.py
--train-data=${train_data}
--val-data=${val_data}
--resume=${resume}
${reset_data_offset}
${reset_optimizer}
--logs=${output_base_dir}
--name=${name}
--save-step-frequency=${save_step_frequency}
--save-epoch-frequency=${save_epoch_frequency}
--log-interval=${log_interval}
${report_training_batch_acc}
--context-length=${context_length}
--warmup=${warmup}
--batch-size=${batch_size}
--valid-batch-size=${valid_batch_size}
--valid-step-interval=${valid_step_interval}
--valid-epoch-interval=${valid_epoch_interval}
--lr=${lr}
--accum_freq=${accum_freq}
--wd=${wd}
--max-epochs=${max_epochs}
--vision-model=${vision_model}
${use_augment}
--text-model=${text_model}
--grad-checkpointing

你可以看看cn_clip/training/params.py文件，搜索下accum-freq看看有没有这个参数

如果你要用分布式，也可以ps -ef | grep main检查下进程

Apr 12 '24 03:04 ChesonHuang

可以试试下面的命令看看吗

先 cd sdb1/lxl2/Chinese-CLIP-master/

python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing

你可以看看cn_clip/training/params.py文件，搜索下accum-freq看看有没有这个参数

如果你要用分布式，也可以ps -ef | grep main检查下进程

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好，运行结果如上。另外，params.py中有accum-freq这个参数

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main

Apr 12 '24 04:04 iWangTing

可以试试下面的命令看看吗先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件，搜索下accum-freq看看有没有这个参数如果你要用分布式，也可以ps -ef | grep main检查下进程

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好，运行结果如上。另外，params.py中有accum-freq这个参数

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main

把这个命令替换你sh脚本中原来的torchrun的命令执行，不是直接在终端这样执行，例如：把脚本中下面绿色的去到 clip

Apr 12 '24 07:04 ChesonHuang

可以试试下面的命令看看吗先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件，搜索下accum-freq看看有没有这个参数如果你要用分布式，也可以ps -ef | grep main检查下进程

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好，运行结果如上。另外，params.py中有accum-freq这个参数

> > (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
> 
> 把这个命令替换你sh脚本中原来的torchrun的命令执行，不是直接在终端这样执行，例如：把脚本中下面绿色的去到 ![clip](https://private-user-images.githubusercontent.com/34369493/321905989-299cdae4-44a8-41a9-ae64-f6299fb40a1b.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI5MDg2MTgsIm5iZiI6MTcxMjkwODMxOCwicGF0aCI6Ii8zNDM2OTQ5My8zMjE5MDU5ODktMjk5Y2RhZTQtNDRhOC00MWE5LWFlNjQtZjYyOTlmYjQwYTFiLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDEyVDA3NTE1OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJmODA2NDZiZmVmYTU5ZWZhZWE0ZWIyYjUzYTE4NGQ5ZGI2ZTIyMDZhYzEzZjBkOWRmMzVlMjExMGFkOTdkMGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.a3FfmM1bJ43d9QzXvVKdqihpD9BFZZ3REqpG6R0yfT4)

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 16, in <module>
    from cn_clip.clip import load
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/__init__.py", line 4, in <module>
    from .model import convert_state_dict
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/model.py", line 16, in <module>
    FlashMHA = importlib.import_module('flash_attn.flash_attention').FlashMHA
  File "/home/amax/.conda/envs/lxl/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attention.py", line 7, in <module>
    from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 5, in <module>
    import flash_attn_cuda
ImportError: /home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
这是按照您说的先cd后，再替换脚本中命令行后的结果

Apr 12 '24 07:04 iWangTing

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题，请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

Apr 12 '24 08:04 ChesonHuang

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题，请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了，但还是不行。这个issues指的应该是mmcv的，但我这个是flash-attn的。我又从flash-attn相关的issues上找了相关解决方法，还是不行，貌似flash-attn支持的torch是1.12以上的，我的是1.10，并且我也没有要用flash-attn，如何在代码中关闭或者忽略flash-attn相关的内容呢？

Apr 13 '24 13:04 iWangTing

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题，请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了，但还是不行。这个issues指的应该是mmcv的，但我这个是flash-attn的。我又从flash-attn相关的issues上找了相关解决方法，还是不行，貌似flash-attn支持的torch是1.12以上的，我的是1.10，并且我也没有要用flash-attn，如何在代码中关闭或者忽略flash-attn相关的内容呢？

pip uninstall flash_attn

Apr 13 '24 13:04 ChesonHuang

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题，请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了，但还是不行。这个issues指的应该是mmcv的，但我这个是flash-attn的。我又从flash-attn相关的issues上找了相关解决方法，还是不行，貌似flash-attn支持的torch是1.12以上的，我的是1.10，并且我也没有要用flash-attn，如何在代码中关闭或者忽略flash-attn相关的内容呢？

pip uninstall flash_attn

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--valid-num-workers VALID_NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc]
               [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL]
               [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
               [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
               [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
               [--grad-checkpointing] [--use-flash-attention] [--gather-with-grad] [--skip-aggregate] [--debug] [--seed SEED] [--distllation] [--teacher-model-name TEACHER_MODEL_NAME] [--kd_loss_weight KD_LOSS_WEIGHT]
               [--accum-freq ACCUM_FREQ]
main.py: error: unrecognized arguments: --accum_freq=1

额执行完您说的“先 cd sdb1/lxl2/Chinese-CLIP-master/...............”，出现了以上的报错，回到开始了属实是

Apr 13 '24 13:04 iWangTing

accum_freq

shell脚本里面，将--accum_freq=xxx 改成 --accum-freq=xxx

Apr 13 '24 13:04 ChesonHuang

accum_freq

shell脚本里面，将--accum_freq=xxx 改成 --accum-freq=xxx

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

Apr 13 '24 13:04 iWangTing

accum_freq

shell脚本里面，将--accum_freq=xxx 改成 --accum-freq=xxx

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1：shell脚本里面，加上

改法2：main.py里面

Apr 13 '24 13:04 ChesonHuang

accum_freq

shell脚本里面，将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1：shell脚本里面，加上

改法2：main.py里面

Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。

Apr 13 '24 15:04 iWangTing

accum_freq

shell脚本里面，将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1：shell脚本里面，加上改法2：main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。报错里面说没有环境变量，环境变量可以像这样配置，加上 export WORLD_SIZE=xx 就可以

Apr 14 '24 04:04 ChesonHuang

accum_freq

shell脚本里面，将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1：shell脚本里面，加上改法2：main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。报错里面说没有环境变量，环境变量可以像这样配置，加上 export WORLD_SIZE=xx 就可以

主要问题已经基本已经解决了，可以先训练了，感谢多日以来的耐心指导，感激之情溢于言表~[抱拳]

Apr 15 '24 03:04 iWangTing

Chinese-CLIP Chinese-CLIP copied to clipboard

这个问题太折磨了，找不到解决方法，有没有大神看一下

Chinese-CLIP
Chinese-CLIP copied to clipboard