SAM-Adapter-PyTorch icon indicating copy to clipboard operation
SAM-Adapter-PyTorch copied to clipboard

train not succeed

Open skycat88 opened this issue 2 years ago • 16 comments

size mismatch for image_encoder.blocks.23.mlp.lin1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([5120, 1280]). size mismatch for image_encoder.blocks.23.mlp.lin1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([5120]). size mismatch for image_encoder.blocks.23.mlp.lin2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([1280, 5120]). size mismatch for image_encoder.blocks.23.mlp.lin2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1280]). size mismatch for image_encoder.neck.0.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 1280, 1, 1]). ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3207998) of binary: /home/syy/anaconda3/envs/SAM_Adapter/bin/python Traceback (most recent call last): File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================ train.py FAILED

Failures: [1]: time : 2023-04-24_19:02:47 host : vip rank : 1 (local_rank: 1) exitcode : 1 (pid: 3208003) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-04-24_19:02:47 host : vip rank : 2 (local_rank: 2) exitcode : 1 (pid: 3208005) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-04-24_19:02:47 host : vip rank : 3 (local_rank: 3) exitcode : 1 (pid: 3208011) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-04-24_19:02:47 host : vip rank : 0 (local_rank: 0) exitcode : 1 (pid: 3207998) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(SAM_Adapter) syy@vip:~/code/data_auto/SAM-Adapter-PyTorch$ python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 train.py --config configs/demo.yaml

1、环境版本按照要求配置 readme 中的 loadddptrain.py 没有,使用的是train, 2、下载的数据是cmos,, 请问数据处理有其他要求吗 训练实验用的数据,只有下面的伪装物检测数据,制作的1500 CAMO-COCO-V.1.0-CVIU2019\Camouflage\Images GT image

skycat88 avatar Apr 25 '23 06:04 skycat88

I have the same issue. image

feijifei avatar Apr 26 '23 06:04 feijifei

Same question

laiyingxin2 avatar Apr 27 '23 13:04 laiyingxin2

I have the same issue. image

解决了吗

Chukuanren avatar Apr 28 '23 07:04 Chukuanren

same question!

huizhang0110 avatar May 06 '23 12:05 huizhang0110

same question!

85zhanghao avatar May 10 '23 13:05 85zhanghao

you can try: CUDA_VISIBLE_DEVICES=8,9,10,11 python -m torch.distributed.launch train.py

Darren759 avatar May 12 '23 07:05 Darren759

@skycat88 You should change the config .yaml file in ./configs/, specifically the pretrained weight should be h rather than l.

yPanStupidog avatar May 16 '23 17:05 yPanStupidog

I think the train command should be: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 train.py --config [CONFIG_PATH]

Chenjuanwen avatar May 25 '23 09:05 Chenjuanwen

how do you solve it? Is there four gpus?so if i only have one ,how should I change the command?thank you all

lzn12345 avatar Jun 04 '23 17:06 lzn12345

same question!

yanre-hyd avatar Jun 08 '23 08:06 yanre-hyd

Please check whether you have change the config .yaml file in ./configs/ with the right SAM checkpoints file.

tianrun-chen avatar Jun 10 '23 14:06 tianrun-chen

solve same exception by install package strictly as the requirements and modify the config

lydmom avatar Jun 12 '23 06:06 lydmom

@skycat88您应该更改 ./configs/ 中的配置 .yaml 文件,特别是预训练权重应该是 h 而不是 l。

是这样的,我改了一下就可以运行了

theneao avatar Jul 06 '23 11:07 theneao

I have the same issue. image

解决了吗

同问 解决了吗

almighty79251 avatar Sep 22 '23 01:09 almighty79251

how do you solve it? Is there four gpus?so if i only have one ,how should I change the command?thank you all

May I ask if you have resolved the problem

guokeqianhg avatar Apr 17 '24 14:04 guokeqianhg

我有同样的问题。图像

解决了吗

同问 解决了吗

maybe? i can use ddp now,i i resolved this problem by reseted my ubuntu ,but i have not to do this program

yanre-hyd avatar Apr 18 '24 10:04 yanre-hyd