Add the data serialization factor in COCO-JSON-LOADER

Open yhy258 opened this issue 1 week ago • 2 comments

This is for the issue #307 . (Closes #307 )

When I finetuned the sam3 with a large dataset (like larger than 1M 2D images), I faced the memory leak problem in the dataloader.

Then, I followed this post: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/, and it worked well (torch-serialization).

If we want to use the torch-serialization method, we can just adjust the coco_json_loader component in yaml file. Please just add grouped_serialzation: true like this:

data:
    train:
      _target_: sam3.train.data.torch_dataset.TorchDataset
      dataset:
        _target_: sam3.train.data.sam3_image_dataset.Sam3ImageDataset
        limit_ids: ${biomedseg2d_train.num_images}
        transforms: ${biomedseg2d_train.train_transforms}
        load_segmentation: ${scratch.enable_segmentation}
        coco_json_loader:
          _target_: sam3.train.data.coco_json_loaders.COCO_FROM_JSON
          category_chunk_size: 2
          grouped_serialzation: true # !!!
          _partial_: true

Dec 18 '25 02:12 yhy258