AdaBins icon indicating copy to clipboard operation
AdaBins copied to clipboard

Train: 0%

Open 760677482 opened this issue 2 years ago • 10 comments

In the beginning,I noticed that I didn‘t have "pytorch3d",so I used "pip install pytorch3d",but it showed an error.then I used"pip unintall pytorch3d"and downloaded it from https://anaconda.org/pytorch3d/pytorch3d/files. But now, when training,it's always "Epoch: 1/25. Loop: Train: 0% 0/11579 [03:37<?, ?it/s]". I found the program stopped at this line:loss.backward().

What could be the problem?And I am using cuda 9.2 because my Driver version is outdated.looking forword to your help,thanks !

760677482 avatar Dec 27 '21 13:12 760677482

You can try unset the flag of args.distributed .

eugenelyj avatar Jan 06 '22 06:01 eugenelyj

Please refer to instructions provided here to install pytorch3d.

If you can't install pytorch3d for your driver version, you may also give a try to pytorch3d-nightly.

As @eugenelyj pointed out, try unsetting the distributed flag. You may get a better traceback.

shariqfarooq123 avatar Jan 17 '22 22:01 shariqfarooq123

Hello,I had the same problem. When I ran ‘python train.py args_train_nyu.txt’,The program stops here. Can you help me? image

libetter0913 avatar May 13 '22 07:05 libetter0913

Hello,I had the same problem. When I ran ‘python train.py args_train_nyu.txt’,The program stops here. Can you help me? image

9796l avatar Dec 22 '22 12:12 9796l

excuse me,did you solve this problem?

9796l avatar Dec 22 '22 12:12 9796l

excuse me,did you solve this problem?

Did you sovle the problem?

zhangbaijin avatar Mar 24 '23 00:03 zhangbaijin

我好像是换了个服务器,用了4块显卡的服务器就没有报错了。

------------------ 原始邮件 ------------------ 发件人: "shariqfarooq123/AdaBins" @.>; 发送时间: 2023年3月24日(星期五) 上午8:40 @.>; @.@.>; 主题: Re: [shariqfarooq123/AdaBins] Train: 0% (Issue #53)

excuse me,did you solve this problem?

Did you sovle the problem?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

9796l avatar Mar 24 '23 02:03 9796l

我好像是换了个服务器,用了4块显卡的服务器就没有报错了。 ------------------ 原始邮件 ------------------ 发件人: "shariqfarooq123/AdaBins" @.>; 发送时间: 2023年3月24日(星期五) 上午8:40 @.>; @.@.>; 主题: Re: [shariqfarooq123/AdaBins] Train: 0% (Issue #53) excuse me,did you solve this problem? Did you sovle the problem? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***> 我这边用的自己的数据集,raw_image,和depth image。作者说的input.txt是指的哪个文件呢,后面857.47又代表啥意思呢? ` Traceback (most recent call last): File "train.py", line 403, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/root/autodl-tmp/UDepth-master/train.py", line 109, in main_worker experiment_name=args.name, optimizer_state_dict=None) File "/root/autodl-tmp/UDepth-master/train.py", line 178, in train args) else enumerate(train_loader): File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/autodl-tmp/UDepth-master/dataloader.py", line 87, in getitem focal = float(sample_path.split()[2]) IndexError: list index out of range `

zhangbaijin avatar Mar 24 '23 03:03 zhangbaijin

我好像是换了个服务器,用了4块显卡的服务器就没有报错了。 ------------------ 原始邮件 ------------------ 发件人: "shariqfarooq123/AdaBins" @.>; 发送时间: 2023年3月24日(星期五) 上午8:40 @.>; @.@.>; 主题: Re: [shariqfarooq123/AdaBins] Train: 0% (Issue #53) excuse me,did you solve this problem? Did you sovle the problem? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***> 不知道可否加微信咨询一下,我的是SemiMobile

zhangbaijin avatar Mar 24 '23 03:03 zhangbaijin

我好像是换了个服务器,用了4块显卡的服务器就没有报错了。 ------------------ 原始邮件 ------------------ 发件人: "shariqfarooq123/AdaBins" @.>; 发送时间: 2023年3月24日(星期五) 上午8:40 _@**._>; _@.@._>; 主题: Re: [shariqfarooq123/AdaBins] Train: 0% (Issue #53) excuse me,did you solve this problem? Did you sovle the problem? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: _@_.*> 我这边用的自己的数据集,raw_image,和depth image。作者说的input.txt是指的哪个文件呢,后面857.47又代表啥意思呢? ` Traceback (most recent call last): File "train.py", line 403, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/root/autodl-tmp/UDepth-master/train.py", line 109, in main_worker experiment_name=args.name, optimizer_state_dict=None) File "/root/autodl-tmp/UDepth-master/train.py", line 178, in train args) else enumerate(train_loader): File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/autodl-tmp/UDepth-master/dataloader.py", line 87, in getitem focal = float(sample_path.split()[2]) IndexError: list index out of range `

focal的问题,你分割的文件里后面肯定没有标focal length in pixels,读不出来就报错了

jasdkfj avatar Aug 03 '23 13:08 jasdkfj