supervision icon indicating copy to clipboard operation
supervision copied to clipboard

[DetectionDataset] - enable lazy dataset loading

Open hardikdava opened this issue 2 years ago • 10 comments

Search before asking

  • [X] I have searched the Supervision issues and found no similar bug report.

Bug

sv.DetectionDataset is loading images unnecessary. It is suggestable that it only loads image when it is necessary. This can be useful for loading large dataset without keeping memory.

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • [X] Yes I'd like to help by submitting a PR!

hardikdava avatar Aug 24 '23 08:08 hardikdava

Hi @hardikdava 👋🏻!

Here is my idea. Let's create a set of separate methods sv.DetectionDataset.generate_from_*. Unlike sv.DetectionDataset.from_*, it would return a Python generator. What do you think?

  • Typing
sv.DetectionDataset.generate_from_yolo(
    images_directory_path: str, 
    annotations_directory_path: str
) -> Generator[Tuple[str, sv.Detections, np.ndarray], None, None]:
    pass
  • Usage example
for path, image, detections in sv.DetectionDataset.generate_from_yolo(...):
    pass
  • Method names
- sv.DetectionDataset.generate_from_yolo(...)
- sv.DetectionDataset.generate_from_coco(...)
- sv.DetectionDataset.generate_from_pascal_voc(...)

SkalskiP avatar Aug 25 '23 13:08 SkalskiP

@SkalskiP Is there anyway that we only modify current APIs, otherwise users will be confused between sv.DetectionsDataset.from_* and sv.DetectionDataset.generate_from* methods.

hardikdava avatar Aug 25 '23 13:08 hardikdava

@SkalskiP is it possible that we use callback system for loading images then we do not have to worry about much things.

hardikdava avatar Aug 28 '23 16:08 hardikdava

@hardikdava didn't you tell me a few weeks ago that callbacks make everything more complicated?

SkalskiP avatar Aug 29 '23 12:08 SkalskiP

I also ran into this issue: https://github.com/autodistill/autodistill/issues/45

In this case the problem was with the ClassificationDataset. I would suggest to just keep track of image paths instead of images and then load them whenever an image is accessed. A relatively easy way to implement this would be to replace the "images" dict that maps from str to ndarray with kind of a lazy loading dict where the setter just sets filenames but the getter loads the image. I'm not sure where these classes are used and if it is critical for performance, like during training of an image classification model. I'm assuming it's not used for this case, but if it were I'd probably just resort to using more efficient solutions like pytorch datasets + dataloaders.

tfriedel avatar Aug 31 '23 18:08 tfriedel

Example:

from collections.abc import MutableMapping

class LazyLoadDict(MutableMapping):
    def __init__(self, initial_data: Dict[str, str]):
        self._data = initial_data

    def __getitem__(self, key: str) -> np.ndarray:
        return cv2.imread(self._data[key])

    def __setitem__(self, key: str, value: str) -> None:
        self._data[key] = value

    def __delitem__(self, key: str) -> None:
        del self._data[key]

    def __iter__(self):
        return iter(self._data)

    def __len__(self):
        return len(self._data)

@dataclass
class ClassificationDataset(BaseDataset):
    classes: List[str]
    images: LazyLoadDict
    annotations: Dict[str, Classifications]

    def __len__(self) -> int:
        return len(self.images)

    def split(self, split_ratio=0.8, random_state=None, shuffle: bool = True) -> Tuple[ClassificationDataset, ClassificationDataset]:
        image_names = list(self.images.keys())
        train_names, test_names = train_test_split(
            data=image_names,
            train_ratio=split_ratio,
            random_state=random_state,
            shuffle=shuffle,
        )
        train_dataset = ClassificationDataset(
            classes=self.classes,
            images=LazyLoadDict({name: self.images._data[name] for name in train_names}),
            annotations={name: self.annotations[name] for name in train_names},
        )
        test_dataset = ClassificationDataset(
            classes=self.classes,
            images=LazyLoadDict({name: self.images._data[name] for name in test_names}),
            annotations={name: self.annotations[name] for name in test_names},
        )
        return train_dataset, test_dataset

    # ... (rest of the methods, adjusted to use LazyLoadDict when needed)

tfriedel avatar Aug 31 '23 18:08 tfriedel

Thanks @tfriedel for suggestions. We will take a look into it soon. This might be the solution of our current issue.

hardikdava avatar Sep 04 '23 16:09 hardikdava

I implemented this bit to solve the issue of training a 10.000+ image dataset on my machine. I did this both for ClassificationDataset and DetectionDataset. Additionally I also had to swap out the dict detections_map and replace it with a shelve (basically a dict that's stored on disk). The results were basically segmentation maps, and those also consumed too much memory. The modifications were done both to the supervision package and the autodistill base models. I'm not sure if this is enough and I could make a PR for those two bits. But you want to extend it probably. I also don't think the shelve solution is the most elegant, but it solved my urgent need in the quickest way.

tfriedel avatar Sep 04 '23 16:09 tfriedel

@tfriedel feel free to open a PR. Please visit contribution guide before you make a PR.

hardikdava avatar Sep 04 '23 16:09 hardikdava

I added two PRs: https://github.com/roboflow/supervision/pull/353 https://github.com/autodistill/autodistill/pull/48

Please feel free to make further changes to those.

tfriedel avatar Sep 04 '23 20:09 tfriedel