Chinese-CLIP
Chinese-CLIP copied to clipboard
image_b64为空
Traceback (most recent call last):
File "/root/Chinese-CLIP/cn_clip/training/main.py", line 350, in
Exception in thread [2024-04-08 00:26:44,250] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 114557) of binary: /root/miniconda3/envs/ML/bin/python3
Traceback (most recent call last):
File "/root/miniconda3/envs/ML/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/ML/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
cn_clip/training/main.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-04-08_00:26:44 host : localhost rank : 0 (local_rank: 0) exitcode : 1 (pid: 114557) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
但是我的image_id能后获取base64的编码,且编码正常
应该是参数不正确 GPUS_PER_NODE=1 # 每个机器上的GPU个数 WORKER_CNT=1 # 训练的机器个数,Number of GPU workers, for single-worker training, please set to 1
export RANK=0 # The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
检查RANK和WORKER_CNT参数,要保证RANK的值比WORKER_CNT小
@ChesonHuang 我也报错这个 但是我已经设置了
主要是数据集分割的问题,已解决
@erlan-11 可以说一下在哪里处理的嘛?因为我现在还是遇到该问题
@erlan-11 我认为是tap分割的问题,但是我查看了数据,key和data是一一对应的,并没有出现空格这种NoneType,所以我不知道哪里出现了问题,如果可以能跟我说一下怎么处理嘛?谢谢!
我出现的问题是在数据集分割时,image_id在text中找不到对应的描述,你可以写一个小的脚本看一下是否也出现这个情况