sam3 icon indicating copy to clipboard operation
sam3 copied to clipboard

Bug: ValueError: matrix contains invalid numeric entries

Open quangkmhd opened this issue 3 months ago • 12 comments

INFO 2025-12-04 07:54:24,909 train_utils.py: 268: Train Epoch: [70][200/532] | Batch Time: 0.80 (0.44) | Data Time: 0.00 (0.07) | Mem (GB): 16.00 (16.33/19.00) | Time Elapsed: 00d 04h 24m | Losses/train_all_loss: 2.72e+01 (6.98e+01) | Losses/train_default_loss: 0.00e+00 (0.00e+00) INFO 2025-12-04 07:54:28,391 train_utils.py: 268: Train Epoch: [70][210/532] | Batch Time: 0.34 (0.44) | Data Time: 0.00 (0.06) | Mem (GB): 16.00 (16.34/19.00) | Time Elapsed: 00d 04h 24m | Losses/train_all_loss: 1.08e+01 (7.20e+01) | Losses/train_default_loss: 0.00e+00 (0.00e+00) INFO 2025-12-04 07:54:31,771 train_utils.py: 268: Train Epoch: [70][220/532] | Batch Time: 0.34 (0.43) | Data Time: 0.00 (0.06) | Mem (GB): 17.00 (16.33/19.00) | Time Elapsed: 00d 04h 24m | Losses/train_all_loss: 1.47e+02 (7.06e+01) | Losses/train_default_loss: 0.00e+00 (0.00e+00) INFO 2025-12-04 07:54:35,136 train_utils.py: 268: Train Epoch: [70][230/532] | Batch Time: 0.35 (0.43) | Data Time: 0.00 (0.06) | Mem (GB): 16.00 (16.34/19.00) | Time Elapsed: 00d 04h 25m | Losses/train_all_loss: 5.90e+01 (7.04e+01) | Losses/train_default_loss: 0.00e+00 (0.00e+00) INFO 2025-12-04 07:54:38,600 train_utils.py: 268: Train Epoch: [70][240/532] | Batch Time: 0.40 (0.43) | Data Time: 0.00 (0.05) | Mem (GB): 16.00 (16.34/19.00) | Time Elapsed: 00d 04h 25m | Losses/train_all_loss: 6.55e+01 (7.00e+01) | Losses/train_default_loss: 0.00e+00 (0.00e+00) [rank0]: Traceback (most recent call last): [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/train.py", line 339, in [rank0]: main(args) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/train.py", line 310, in main [rank0]: single_node_runner(cfg, main_port) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/train.py", line 71, in single_node_runner [rank0]: single_proc_run(local_rank=0, main_port=main_port, cfg=cfg, world_size=num_proc) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/train.py", line 58, in single_proc_run [rank0]: trainer.run() [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/trainer.py", line 567, in run [rank0]: self.run_train() [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/trainer.py", line 588, in run_train [rank0]: outs = self.train_epoch(dataloader) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/trainer.py", line 809, in train_epoch [rank0]: self._run_step(batch, phase, loss_mts, extra_loss_mts) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/trainer.py", line 946, in _run_step [rank0]: loss_dict, batch_size, extra_losses = self._step( [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/trainer.py", line 501, in _step [rank0]: find_stages = model(batch) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1661, in forward [rank0]: else self._run_ddp_forward(*inputs, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1487, in _run_ddp_forward [rank0]: return self.module(*inputs, **kwargs) # type: ignore[index] [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/model/sam3_image.py", line 567, in forward [rank0]: out = self.forward_grounding( [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/model/sam3_image.py", line 492, in forward_grounding [rank0]: self._compute_matching(out, self.back_convert(find_target)) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/model/sam3_image.py", line 579, in _compute_matching [rank0]: out["indices"] = self.matcher(out, targets) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/matcher.py", line 643, in forward [rank0]: indices = [ [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/matcher.py", line 644, in [rank0]: _do_matching(c, repeats=repeats, do_filtering=do_filtering) [rank0]: File "/home/qmask_quangnh58/detect/sam3/sam3/train/matcher.py", line 19, in _do_matching [rank0]: i, j = linear_sum_assignment(cost) [rank0]: ValueError: matrix contains invalid numeric entries

quangkmhd avatar Dec 04 '25 00:12 quangkmhd

same error,Have you solved it?

ERROR 2025-12-03 18:50:44,092 train.py: 120: ValueError: matrix contains invalid numeric entries Traceback: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 116, in call self.run_trainer() File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 110, in run_trainer trainer.run() File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 567, in run self.run_train() File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 588, in run_train outs = self.train_epoch(dataloader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 809, in train_epoch self._run_step(batch, phase, loss_mts, extra_loss_mts) File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 946, in _run_step loss_dict, batch_size, extra_losses = self._step( ^^^^^^^^^^^ File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 501, in _step find_stages = model(batch) ^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward else self._run_ddp_forward(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/model/sam3_image.py", line 567, in forward out = self.forward_grounding( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/model/sam3_image.py", line 492, in forward_grounding self._compute_matching(out, self.back_convert(find_target)) File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/model/sam3_image.py", line 579, in _compute_matching out["indices"] = self.matcher(out, targets) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/anaconda3/envs/sam3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/matcher.py", line 644, in forward _do_matching(c, repeats=repeats, do_filtering=do_filtering)

csqqlee avatar Dec 04 '25 01:12 csqqlee

The same error occurs, but could I see what the val loss values are? I get this output: INFO 2025-12-04 11:07:15,046 trainer.py: 788: Meters: {'Meters_train/val_custom/detection/coco_eval_bbox_AP': 0.9835415892779844, 'Meters_train/val_custom/detection/coco_eval_bbox_AP_50': 0.98940060460127, 'Meters_train/val_custom/detection/coco_eval_bbox_AP_75': 0.987876679005721, 'Meters_train/val_custom/detection/coco_eval_bbox_AP_small': -1.0, 'Meters_train/val_custom/detection/coco_eval_bbox_AP_medium': 0.7965575702714814, 'Meters_train/val_custom/detection/coco_eval_bbox_AP_large': 0.9846369241610899, 'Meters_train/val_custom/detection/coco_eval_bbox_AR_maxDets@1': 0.05720164609053498, 'Meters_train/val_custom/detection/coco_eval_bbox_AR_maxDets@10': 0.550960219478738, 'Meters_train/val_custom/detection/coco_eval_bbox_AR_maxDets@100': 0.9932784636488341, 'Meters_train/val_custom/detection/coco_eval_bbox_AR_small': -1.0, 'Meters_train/val_custom/detection/coco_eval_bbox_AR_medium': 0.8750000000000002, 'Meters_train/val_custom/detection/coco_eval_bbox_AR_large': 0.994923504867872, 'Losses/val_all_loss': 0, 'Losses/val_default_loss': 0, 'Losses/val_custom_core_loss': 0.0, 'Trainer/where': 0.44985074626865673, 'Trainer/epoch': 8, 'Trainer/steps_val': 672}

MinGiSa avatar Dec 04 '25 03:12 MinGiSa

@MinGiSa Is your task an object detection task? What dataset is it? Have you trained it for semantic segmentation tasks?

csqqlee avatar Dec 04 '25 05:12 csqqlee

@MinGiSa Is your task an object detection task? What dataset is it? Have you trained it for semantic segmentation tasks?

For the content mentioned, I constructed a dataset based on custom data for document table representations using instance segmentation with polygon annotations. I proceeded with training by referring to the guidance provided in this issue: How to fine-tune on my own image/video datasets? · Issue #163 · facebookresearch/sam3.

MinGiSa avatar Dec 04 '25 05:12 MinGiSa

@csqqlee Im using my custom data and segment. No error until 70 epoch.

Image

I comment loss default and uncomment loss for segment

Image Image

quangkmhd avatar Dec 04 '25 06:12 quangkmhd

@quangkmhd Hello, I encountered an error while training semantic segmentation on a custom dataset. The error is as follows:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 340, in
[rank0]: main(args)
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 310, in main
[rank0]: single_node_runner(cfg, main_port)
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 71, in single_node_runner
[rank0]: single_proc_run(local_rank=0, main_port=main_port, cfg=cfg, world_size=num_proc)
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 58, in single_proc_run
[rank0]: trainer.run()
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 567, in run
[rank0]: self.run_train()
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 588, in run_train
[rank0]: outs = self.train_epoch(dataloader)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 809, in train_epoch
[rank0]: self._run_step(batch, phase, loss_mts, extra_loss_mts)
[rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 923, in _run_step
[rank0]: assert isinstance(
[rank0]: ^^^^^^^^^^^
[rank0]: AssertionError: Expected a list of batches, got <class 'dict'>
[rank0]:[W1204 17:31:48.157364025 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

The format of my JSON file is as follows:
{
"info": {
"description": "car_parts"
},
"images": [
{
"id": 1,
"file_name": "te44_jpg.rf.ef911ea20f5873ff8733afab40c81817.jpg",
"width": 1024,
"height": 1024
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 15,
"bbox": [
716.0,
389.0,
33.0,
73.0
],
"segmentation": [
[
734.9328,
457.6,
748.2672,
430.88,
746.0272,
407.68,
723.84,
387.68,
714.9872,
460.96,
734.9328,
457.6
]
],
"area": 1565.2760320000234,
"iscrowd": 0,
"noun_phrase": "right_mirror"
},
{
"id": 2,
"image_id": 1,
"category_id": 11,
"bbox": [
458.0,
252.0,
313.0,
440.0
],
"segmentation": [
[
472.7472,
467.68,
481.6,
597.6,
483.84,
690.88,
550.5072,
687.68,
734.9328,
687.68,
750.5072,
654.24,
770.4528,
544.32,
748.2672,
424.32,
728.32,
370.88,
639.36,
304.32,
599.36,
277.6,
559.36,
260.96,
512.7472,
250.88,
457.1728,
250.88,
463.7872,
340.96,
466.0272,
410.88,
472.7472,
467.68
]
],
"area": 110100.685696,
"iscrowd": 0,
"noun_phrase": "front_right_door"
},
{
"id": 3,
"image_id": 1,
"category_id": 5,
"bbox": [
220.0,
245.0,
263.0,
447.0
],
"segmentation": [
[
414.9328,
247.68,
294.9328,
264.32,
272.7472,
287.68,
221.6528,
377.6,
219.4128,
417.6,
226.0272,
477.6,
234.9872,
510.88,
288.32,
584.32,
326.08,
647.68,
352.7472,
684.32,
386.0272,
684.32,
481.6,
690.88,
481.6,
624.32,
457.1728,
244.32,
414.9328,
247.68
]
],
"area": 90374.42688000007,
"iscrowd": 0,
"noun_phrase": "back_right_door"
},
{
"id": 4,
"image_id": 1,
"category_id": 12,
"bbox": [
876.0,
479.0,
98.0,
60.0
],
"segmentation": [
[
972.6928,
537.6,
959.36,
504.32,
946.0272,
487.68,
926.08,
477.6,
874.9872,
484.32,
934.9328,
520.96,
972.6928,
537.6
]
],
"area": 2356.8351999999722,
"iscrowd": 0,
"noun_phrase": "front_right_light"
},
{
"id": 5,
"image_id": 1,
"category_id": 6,
"bbox": [
105.0,
395.0,
73.0,
67.0
],
"segmentation": [
[
157.12,
427.68,
177.1728,
404.32,
139.4128,
394.24,
117.12,
397.6,
103.7872,
460.96,
157.12,
427.68
]
],
"area": 2412.311936000002,
"iscrowd": 0,
"noun_phrase": "back_right_light"
},
{
"id": 6,
"image_id": 1,
"category_id": 18,
"bbox": [
140.0,
585.0,
783.0,
217.0
],
"segmentation": [
[
288.32,
687.68,
517.6272,
689.328,
788.2672,
750.88,
817.1728,
777.6,
850.4528,
784.32,
881.6,
770.88,
910.5072,
734.24,
919.36,
700.96,
921.6,
650.88,
906.0272,
614.24,
888.32,
594.24,
859.4128,
584.32,
834.9872,
590.88,
806.08,
617.6,
790.5072,
650.88,
781.6528,
684.32,
512.32,
688.1216,
278.8272,
635.0176,
274.9152,
638.7472,
273.52,
633.8112,
268.2128,
632.6048,
266.0272,
627.6,
262.9072,
631.3968,
257.6,
630.1904,
261.5824,
622.0272,
252.6928,
610.88,
223.7872,
597.6,
194.9872,
597.6,
178.2672,
607.56,
178.0,
612.0864,
172.6928,
610.88,
148.2672,
634.24,
142.9552,
658.24,
139.4128,
674.24,
139.4128,
727.68,
159.36,
770.88,
190.5072,
797.6,
234.9872,
800.96,
272.7472,
774.24,
288.32,
737.6,
288.32,
687.68
]
],
"area": 61066.46982656002,
"iscrowd": 0,
"noun_phrase": "wheel"
}
],
"categories": [
{
"id": 1,
"name": "back_bumper"
},
{
"id": 2,
"name": "back_glass"
},
{
"id": 3,
"name": "back_left_door"
},
{
"id": 4,
"name": "back_left_light"
},
{
"id": 5,
"name": "back_right_door"
},
{
"id": 6,
"name": "back_right_light"
},
{
"id": 7,
"name": "front_bumper"
},
{
"id": 8,
"name": "front_glass"
},
{
"id": 9,
"name": "front_left_door"
},
{
"id": 10,
"name": "front_left_light"
},
{
"id": 11,
"name": "front_right_door"
},
{
"id": 12,
"name": "front_right_light"
},
{
"id": 13,
"name": "hood"
},
{
"id": 14,
"name": "left_mirror"
},
{
"id": 15,
"name": "right_mirror"
},
{
"id": 16,
"name": "tailgate"
},
{
"id": 17,
"name": "trunk"
},
{
"id": 18,
"name": "wheel"
}
]
}

Is the dataset JSON format incorrect?

csqqlee avatar Dec 04 '25 09:12 csqqlee

@quangkmhd @csqqlee The error "ValueError: matrix contains invalid numeric entries" means your training is diverging. You can try the usual mitigations: bigger batch size, lower learning rate.

@MinGiSa It looks like you have near perfect AP already at this point in training. Perhaps you can just get the last checkpoint before divergence and call it early stopping :p

alcinos avatar Dec 04 '25 10:12 alcinos

@quangkmhd I'm training on a 24GB memory card. Setting gradient_accumulation_steps to 1 results in an insufficient memory error, but setting it to 4 results in another error. Raw dataset length = 9 Raw dataset length = 38 [rank0]: Traceback (most recent call last): [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 340, in [rank0]: main(args) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 310, in main [rank0]: single_node_runner(cfg, main_port) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 71, in single_node_runner [rank0]: single_proc_run(local_rank=0, main_port=main_port, cfg=cfg, world_size=num_proc) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 58, in single_proc_run [rank0]: trainer.run() [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 567, in run [rank0]: self.run_train() [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 588, in run_train [rank0]: outs = self.train_epoch(dataloader) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 809, in train_epoch [rank0]: self._run_step(batch, phase, loss_mts, extra_loss_mts) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 923, in _run_step [rank0]: assert isinstance( [rank0]: ^^^^^^^^^^^ [rank0]: AssertionError: Expected a list of batches, got <class 'dict'> [rank0]:[W1204 18:06:00.068970740 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

csqqlee avatar Dec 04 '25 10:12 csqqlee

@quangkmhd Hello, I encountered an error while training semantic segmentation on a custom dataset. The error is as follows: [rank0]: Traceback (most recent call last): [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 340, in [rank0]: main(args) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 310, in main [rank0]: single_node_runner(cfg, main_port) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 71, in single_node_runner [rank0]: single_proc_run(local_rank=0, main_port=main_port, cfg=cfg, world_size=num_proc) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 58, in single_proc_run [rank0]: trainer.run() [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 567, in run [rank0]: self.run_train() [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 588, in run_train [rank0]: outs = self.train_epoch(dataloader) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 809, in train_epoch [rank0]: self._run_step(batch, phase, loss_mts, extra_loss_mts) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 923, in _run_step [rank0]: assert isinstance( [rank0]: ^^^^^^^^^^^ [rank0]: AssertionError: Expected a list of batches, got <class 'dict'> [rank0]:[W1204 17:31:48.157364025 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

The format of my JSON file is as follows: { "info": { "description": "car_parts" }, "images": [ { "id": 1, "file_name": "te44_jpg.rf.ef911ea20f5873ff8733afab40c81817.jpg", "width": 1024, "height": 1024 } ], "annotations": [ { "id": 1, "image_id": 1, "category_id": 15, "bbox": [ 716.0, 389.0, 33.0, 73.0 ], "segmentation": [ [ 734.9328, 457.6, 748.2672, 430.88, 746.0272, 407.68, 723.84, 387.68, 714.9872, 460.96, 734.9328, 457.6 ] ], "area": 1565.2760320000234, "iscrowd": 0, "noun_phrase": "right_mirror" }, { "id": 2, "image_id": 1, "category_id": 11, "bbox": [ 458.0, 252.0, 313.0, 440.0 ], "segmentation": [ [ 472.7472, 467.68, 481.6, 597.6, 483.84, 690.88, 550.5072, 687.68, 734.9328, 687.68, 750.5072, 654.24, 770.4528, 544.32, 748.2672, 424.32, 728.32, 370.88, 639.36, 304.32, 599.36, 277.6, 559.36, 260.96, 512.7472, 250.88, 457.1728, 250.88, 463.7872, 340.96, 466.0272, 410.88, 472.7472, 467.68 ] ], "area": 110100.685696, "iscrowd": 0, "noun_phrase": "front_right_door" }, { "id": 3, "image_id": 1, "category_id": 5, "bbox": [ 220.0, 245.0, 263.0, 447.0 ], "segmentation": [ [ 414.9328, 247.68, 294.9328, 264.32, 272.7472, 287.68, 221.6528, 377.6, 219.4128, 417.6, 226.0272, 477.6, 234.9872, 510.88, 288.32, 584.32, 326.08, 647.68, 352.7472, 684.32, 386.0272, 684.32, 481.6, 690.88, 481.6, 624.32, 457.1728, 244.32, 414.9328, 247.68 ] ], "area": 90374.42688000007, "iscrowd": 0, "noun_phrase": "back_right_door" }, { "id": 4, "image_id": 1, "category_id": 12, "bbox": [ 876.0, 479.0, 98.0, 60.0 ], "segmentation": [ [ 972.6928, 537.6, 959.36, 504.32, 946.0272, 487.68, 926.08, 477.6, 874.9872, 484.32, 934.9328, 520.96, 972.6928, 537.6 ] ], "area": 2356.8351999999722, "iscrowd": 0, "noun_phrase": "front_right_light" }, { "id": 5, "image_id": 1, "category_id": 6, "bbox": [ 105.0, 395.0, 73.0, 67.0 ], "segmentation": [ [ 157.12, 427.68, 177.1728, 404.32, 139.4128, 394.24, 117.12, 397.6, 103.7872, 460.96, 157.12, 427.68 ] ], "area": 2412.311936000002, "iscrowd": 0, "noun_phrase": "back_right_light" }, { "id": 6, "image_id": 1, "category_id": 18, "bbox": [ 140.0, 585.0, 783.0, 217.0 ], "segmentation": [ [ 288.32, 687.68, 517.6272, 689.328, 788.2672, 750.88, 817.1728, 777.6, 850.4528, 784.32, 881.6, 770.88, 910.5072, 734.24, 919.36, 700.96, 921.6, 650.88, 906.0272, 614.24, 888.32, 594.24, 859.4128, 584.32, 834.9872, 590.88, 806.08, 617.6, 790.5072, 650.88, 781.6528, 684.32, 512.32, 688.1216, 278.8272, 635.0176, 274.9152, 638.7472, 273.52, 633.8112, 268.2128, 632.6048, 266.0272, 627.6, 262.9072, 631.3968, 257.6, 630.1904, 261.5824, 622.0272, 252.6928, 610.88, 223.7872, 597.6, 194.9872, 597.6, 178.2672, 607.56, 178.0, 612.0864, 172.6928, 610.88, 148.2672, 634.24, 142.9552, 658.24, 139.4128, 674.24, 139.4128, 727.68, 159.36, 770.88, 190.5072, 797.6, 234.9872, 800.96, 272.7472, 774.24, 288.32, 737.6, 288.32, 687.68 ] ], "area": 61066.46982656002, "iscrowd": 0, "noun_phrase": "wheel" } ], "categories": [ { "id": 1, "name": "back_bumper" }, { "id": 2, "name": "back_glass" }, { "id": 3, "name": "back_left_door" }, { "id": 4, "name": "back_left_light" }, { "id": 5, "name": "back_right_door" }, { "id": 6, "name": "back_right_light" }, { "id": 7, "name": "front_bumper" }, { "id": 8, "name": "front_glass" }, { "id": 9, "name": "front_left_door" }, { "id": 10, "name": "front_left_light" }, { "id": 11, "name": "front_right_door" }, { "id": 12, "name": "front_right_light" }, { "id": 13, "name": "hood" }, { "id": 14, "name": "left_mirror" }, { "id": 15, "name": "right_mirror" }, { "id": 16, "name": "tailgate" }, { "id": 17, "name": "trunk" }, { "id": 18, "name": "wheel" } ] }

Is the dataset JSON format incorrect?

I think it correct

quangkmhd avatar Dec 04 '25 14:12 quangkmhd

@quangkmhd I'm training on a 24GB memory card. Setting gradient_accumulation_steps to 1 results in an insufficient memory error, but setting it to 4 results in another error. Raw dataset length = 9 Raw dataset length = 38 [rank0]: Traceback (most recent call last): [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 340, in [rank0]: main(args) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 310, in main [rank0]: single_node_runner(cfg, main_port) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 71, in single_node_runner [rank0]: single_proc_run(local_rank=0, main_port=main_port, cfg=cfg, world_size=num_proc) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/train.py", line 58, in single_proc_run [rank0]: trainer.run() [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 567, in run [rank0]: self.run_train() [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 588, in run_train [rank0]: outs = self.train_epoch(dataloader) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 809, in train_epoch [rank0]: self._run_step(batch, phase, loss_mts, extra_loss_mts) [rank0]: File "/home/qqlee/Desktop/sam3/sam3-onnxruntime/sam3/train/trainer.py", line 923, in _run_step [rank0]: assert isinstance( [rank0]: ^^^^^^^^^^^ [rank0]: AssertionError: Expected a list of batches, got <class 'dict'> [rank0]:[W1204 18:06:00.068970740 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

You can refer to this issue. https://github.com/facebookresearch/sam3/issues/200

rishi1134 avatar Dec 04 '25 19:12 rishi1134

@quangkmhd @csqqlee The error "ValueError: matrix contains invalid numeric entries" means your training is diverging. You can try the usual mitigations: bigger batch size, lower learning rate.

@MinGiSa It looks like you have near perfect AP already at this point in training. Perhaps you can just get the last checkpoint before divergence and call it early stopping :p

I’m facing a similar error, but the training loss curves (Val AP curves too) look healthy and don’t appear to be diverging. I’m only fine‑tuning the detector head with a batch size of 32 and a learning rate of 1e‑5, so it seems unlikely that the issue is related to divergence.

Image

rishi1134 avatar Dec 04 '25 19:12 rishi1134

@rishi1134 Hello, I'm training on a 4090 GPU. Training the detection model works fine, but training the segmentation model results in an insufficient GPU memory error. Trying to increase the gradient_accumulation_steps value to reduce GPU memory usage causes this error.

csqqlee avatar Dec 05 '25 02:12 csqqlee

This problem is caused by IABCEMdetr loss function. In this function, the presence_gamma's default value is 0. If the model correctly predicts entire presences, then the backward would be NaN. (loss would be inf.) You can set the presence_gamma in this function to 2, or you can modify the "_inner_focal_loss_bwd" in sigmoid_focal_loss.py more stably.

yhy258 avatar Jan 07 '26 02:01 yhy258

This problem is caused by IABCEMdetr loss function. In this function, the presence_gamma's default value is 0. If the model correctly predicts entire presences, then the backward would be NaN. (loss would be inf.) You can set the presence_gamma in this function to 2, or you can modify the "_inner_focal_loss_bwd" in sigmoid_focal_loss.py more stably.

Hi. The gamma value is defaultly set as 2.0 in their example training configuration. I still encountered this error during fine-tuning, especially when unfreezing more weights to train. Could you expand on how to modify the _inner_focal_loss_bwd func. to be more stable? Thanks.

Image

summelon avatar Feb 04 '26 02:02 summelon

Hi @summelon , As shown in the below class initialization section, you can see the presence_gamma=0.0 . I indicated this. You can add presence_gamma: 2.0 in your config.

class IABCEMdetr(LossWithWeights):
    def __init__(
        self,
        pos_weight,
        weight_dict=None,
        compute_aux=True,
        gamma=0,
        weak_loss=True,
        alpha=0.25,
        pad_n_queries=None,
        pad_scale_pos=1.0,
        use_separate_loss_for_det_and_trk=False,
        num_det_queries=None,
        det_exhaustive_loss_scale_pos=1.0,
        det_exhaustive_loss_scale_neg=1.0,
        det_non_exhaustive_loss_scale_pos=1.0,
        det_non_exhaustive_loss_scale_neg=1.0,
        trk_loss_scale_pos=1.0,
        trk_loss_scale_neg=1.0,
        no_loss_for_fp_propagation=False,
        apply_loss_to_det_queries_in_video_grounding=True,
        use_presence=False,
        use_presence_semgseg=False,  # If True, use presence scores from the semgseg head.
        presence_alpha=0.5,
        presence_gamma=0.0,
        pos_focal: bool = False,  # for box scores, use focal loss for positives as well
    ):

yhy258 avatar Feb 04 '26 02:02 yhy258

Thank you very much @yhy258 for the correction. I mistook the subtle name difference. Later I will have a try on this change.

summelon avatar Feb 04 '26 02:02 summelon