Chinese-CLIP icon indicating copy to clipboard operation
Chinese-CLIP copied to clipboard

image_b64为空

Open erlan-11 opened this issue 10 months ago • 7 comments

Traceback (most recent call last): File "/root/Chinese-CLIP/cn_clip/training/main.py", line 350, in main() File "/root/Chinese-CLIP/cn_clip/training/main.py", line 298, in main num_steps_this_epoch = train(model, data, epoch, optimizer, scaler, scheduler, args, steps) File "/root/Chinese-CLIP/cn_clip/training/train.py", line 165, in train batch = next(data_iter) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise raise exception AttributeError: Caught AttributeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/Taidi/Chinese-CLIP/cn_clip/training/data.py", line 109, in getitem image_b64 = self.txn_imgs.get("{}".format(image_id).encode('utf-8')).tobytes() AttributeError: 'NoneType' object has no attribute 'tobytes'

Exception in thread [2024-04-08 00:26:44,250] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 114557) of binary: /root/miniconda3/envs/ML/bin/python3 Traceback (most recent call last): File "/root/miniconda3/envs/ML/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/ML/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 198, in main() File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

cn_clip/training/main.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-08_00:26:44 host : localhost rank : 0 (local_rank: 0) exitcode : 1 (pid: 114557) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

但是我的image_id能后获取base64的编码,且编码正常

erlan-11 avatar Apr 07 '24 16:04 erlan-11

应该是参数不正确 GPUS_PER_NODE=1 # 每个机器上的GPU个数 WORKER_CNT=1 # 训练的机器个数,Number of GPU workers, for single-worker training, please set to 1

export RANK=0 # The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0

检查RANK和WORKER_CNT参数,要保证RANK的值比WORKER_CNT小

ChesonHuang avatar Apr 09 '24 08:04 ChesonHuang

@ChesonHuang 我也报错这个 但是我已经设置了

EasonTuT avatar Apr 14 '24 18:04 EasonTuT

主要是数据集分割的问题,已解决

erlan-11 avatar Apr 14 '24 19:04 erlan-11

@erlan-11 可以说一下在哪里处理的嘛?因为我现在还是遇到该问题

EasonTuT avatar Apr 14 '24 19:04 EasonTuT

@erlan-11 我认为是tap分割的问题,但是我查看了数据,key和data是一一对应的,并没有出现空格这种NoneType,所以我不知道哪里出现了问题,如果可以能跟我说一下怎么处理嘛?谢谢!

EasonTuT avatar Apr 14 '24 19:04 EasonTuT

微信图片_20240415035101 @erlan-11

EasonTuT avatar Apr 14 '24 19:04 EasonTuT

我出现的问题是在数据集分割时,image_id在text中找不到对应的描述,你可以写一个小的脚本看一下是否也出现这个情况

erlan-11 avatar Apr 14 '24 23:04 erlan-11