supervision icon indicating copy to clipboard operation
supervision copied to clipboard

Lazy dataset loading

Open tfriedel opened this issue 1 year ago • 28 comments

Description

implements #316 The "images" dict in DetectionDataset and ClassificationDataset that maps from str to ndarray was replaced by LazyLoadDict where the setter just sets filenames but the getter loads the image. So it only keeps track of filenames instead of image contents which keeps memory usage low and allows larger datasets that are not required to fit in memory.

Type of change

Please delete options that are not relevant.

  • [x] Bug fix (breaking change which fixes an issue)

(It is a breaking change though, as the interface changes)

How has this change been tested, please provide a testcase or example of how you tested the change?

Some unit tests were modified and it was tested with a modified version of autodistill. I'm not claiming nothing breaks as I'm not aware of all use cases nor have I tested all. I only tested autodistill with groundedsam for object detection using masks and yolov8. You are encouraged to do further testing. Also the issue of high memory consumption is still present with the annotations attribute of DetectionDataset. It could be addressed in a similar way or with a shelve instead of a dict.

Docs

  • [ ] Docs updated? What were the changes: Doc have not been updated, but since the interface was changed (images now needs to be LazyLoadDict instead of a regular dict) any documentation regarding this still needs to be changed.

tfriedel avatar Sep 04 '23 19:09 tfriedel

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 04 '23 19:09 CLAassistant

@tfriedel Thanks for PR. We will take a look. It might takes a bit. @capjamesg Can you also review this as it might affects autodistill? @onuralpszr Do you want to take care of this for review?

hardikdava avatar Sep 05 '23 17:09 hardikdava

@hardikdava taking review

onuralpszr avatar Sep 05 '23 17:09 onuralpszr

I am excited about this change! I have run into Out of Memory errors in Colab when working with large datasets with supervision. The only Autodistill use I can find is:

dataset = sv.DetectionDataset(
    self.ontology.classes(), images_map, detections_map
)

dataset.as_yolo(
    output_folder + "/images",
    output_folder + "/annotations",
    min_image_area_percentage=0.01,
    data_yaml_path=output_folder + "/data.yaml",
)

As long as the as_yolo method has been tested and will continue to work as normal, autodistill should not be impacted.

capjamesg avatar Sep 06 '23 11:09 capjamesg

Sounds fine. @capjamesg can you share colab to test the PR if possible?

hardikdava avatar Sep 06 '23 11:09 hardikdava

@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.

hardikdava avatar Sep 06 '23 15:09 hardikdava

@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.

I tested with medium size data i will do test bit more and post my result today plus irl work i had to finish

onuralpszr avatar Sep 06 '23 15:09 onuralpszr

Initial Memory Usage results in loading of images

Before

Images size: 18520 bytes
Images size: 0.01766204833984375 MB
Images size: 1.7248094081878662e-05 GB

After

Images size: 48 bytes
Images size: 4.57763671875e-05 MB
Images size: 4.470348358154297e-08 GB

onuralpszr avatar Sep 06 '23 19:09 onuralpszr

I also did memray test with graphical memory profiler too see heap size and resident size Basically we run python script for load the dataset and exit the program

Before HEAP and Resident size

image

After HEAP and Resident size

image

onuralpszr avatar Sep 06 '23 19:09 onuralpszr

Script I use for load the dataset and I used roboflow script the download the datasets


import sys
import supervision as sv

dataset_location = "fashion-assistant-segmentation-5"
ds = sv.DetectionDataset.from_yolo(
    images_directory_path=f"{dataset_location}/train/images",
    annotations_directory_path=f"{dataset_location}/train/labels",
    data_yaml_path=f"{dataset_location}/data.yaml",
)

# memory usage of the dataset
print(f"Dataset size: {sys.getsizeof(ds)} bytes")
# Convert to MegaBytes (MB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024 / 1024} GB")

# ds.images memory usage
print(f"Images size: {sys.getsizeof(ds.images)} bytes")
# Convert to MegaBytes (MB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024 / 1024} GB")


onuralpszr avatar Sep 06 '23 19:09 onuralpszr

@onuralpszr I think you can not use getsizeof to measure the size of the LazyLoadDict, because:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to. See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

from: https://docs.python.org/3/library/sys.html

Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.

tfriedel avatar Sep 06 '23 20:09 tfriedel

@onuralpszr I think you can not use getsizeof to measure the size of the LazyLoadDict, because:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to. See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

from: https://docs.python.org/3/library/sys.html

Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.

It does get bigger, I used 10k image set (before dataset)

Images size: 295000 bytes Images size: 0.28133392333984375 MB Images size: 0.00027474015951156616 GB

onuralpszr avatar Sep 06 '23 21:09 onuralpszr

@tfriedel @onuralpszr I tested this PR. I would say it is better than exisiting solution. @SkalskiP please take a look also as this might needs in changing API a bit. But the solution works quite well on large dataset.

@tfriedel iterator on loaded dataset is not working at all. Can you take a look? It is quite important and used many places. It would be great if one can tests all the features of sv.DetectionDataset and sv.ClassificationDataset.

hardikdava avatar Sep 07 '23 08:09 hardikdava

Code to reproduce my issue.


import supervision as sv

data = "train2017"

ds = sv.DetectionDataset.from_yolo(
    images_directory_path=f"../../dataset/coco/images/{data}",
    annotations_directory_path=f"../../dataset/coco/labels/{data}",
    data_yaml_path=f"../../supervision/data/coco128.yaml",
)

for image_name, image, labels in ds:
    print(f"{image_name} : {image.shape}, {len(labels)}")


hardikdava avatar Sep 07 '23 08:09 hardikdava

@hardikdava I can not reproduce your problem. For me this works fine. I downloaded the dataset from https://www.kaggle.com/datasets/ultralytics/coco128 and https://raw.githubusercontent.com/ultralytics/ultralytics/main/ultralytics/cfg/datasets/coco128.yaml and adapted the paths, otherwise the same code.

tfriedel avatar Sep 07 '23 08:09 tfriedel

@tfriedel then this might be something wrong from my end. Just a question, can we use lazydict for detections as well? Then, the increase in memory will be solved. What do you think?

hardikdava avatar Sep 07 '23 10:09 hardikdava

@hardikdava for detections see the PR in https://github.com/autodistill/autodistill/pull/48/files where I just store them in a shelve (basically a dict that is not in memory but on-disk). We can not use LazyLoadDict as the detections don't correspond to files (I think?). So you need a way to write them to disk. You could do that first and then use a LazyLoadDict. I'm not super happy with the shelve solution because it's just using a temporary file (maybe on a disk with not enough space) and it can also be problematic if later that object is going to be accessed by multiple threads etc. But it's a quick fix and solved the problem for me. But it's surely worth thinking of better options for this. Another thing that could be improved is the opening of the images in from_yolo. So if you have a large dataset, every image will be read in full only to read height & width. So even though we have a lazy loading solution, it will still be slow for a large dataset. I usually would use Image.open() from PIL, as that would just read the height & width info from the header only and thus would be much faster. Not even sure if height & width needs to be read. Anyway, I didn't want to change too much because I don't know your reasons for going with OpenCV and maybe adding the Pillow dependency is unwanted.

tfriedel avatar Sep 07 '23 10:09 tfriedel

@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?

hardikdava avatar Oct 08 '23 14:10 hardikdava

@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?

Let me re look as well

onuralpszr avatar Oct 08 '23 14:10 onuralpszr

in datasets.formats.yolo when loading yolo annotations the method is using cv2.imread to get image shape, i suggest using some other method to get the shape so it will be faster for large datasets

netanelbi avatar Oct 16 '23 05:10 netanelbi

@onuralpszr @hardikdava Any update on this? OOM'd on my end when using sv.DetectionDataset.from_yolo

AChangXD avatar Dec 26 '23 19:12 AChangXD

Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.

SkalskiP avatar Dec 28 '23 08:12 SkalskiP

I see that images are only opened with cv2 when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:

  1. Load an image
  2. Get an annotation
  3. Save the annotation in the dataset object

Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?

capjamesg avatar Jan 03 '24 15:01 capjamesg

Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.

Yeah getting this working would be amazing! This is a huge blocker for me

AChangXD avatar Jan 08 '24 01:01 AChangXD

I see that images are only opened with cv2 when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:

  1. Load an image
  2. Get an annotation
  3. Save the annotation in the dataset object

Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?

Seeing that it OOM's, I'm guessing it doesn't?

AChangXD avatar Jan 08 '24 01:01 AChangXD

@tfriedel @onuralpszr , i have tested the lazy-dataset-loading branch of supervision, autodistill repos by @tfriedel , stil can't process 2k of images dataset , OOM error, it's not how it handles image_path in dict structure but how it process reading image using cv2 and saving it , it looks like eating memory .

so to recap how things doing under the hood

  1. reading images dir
  2. iterate all paths and saving it in :: lazy (dict) 3.read every image with cv2.imread
  3. annotate
  4. save annotate -> detection map
  5. generate supervision dataset
  6. save images dataset , writing images in labeled dir

something is missing causing all of this OOM

@tfriedel the two repos is ton of commits behind so no shi or nms so it's a little bit tricky , @capjamesg alot of debugging but if autodistill treats how it deals with large dataset , with all features included it , would be another level

Alaaeldinn avatar Feb 12 '24 13:02 Alaaeldinn

@Alaaeldinn thank you for debugging and inside and as for start let me give you quick update for first. Looks like I can't update that branch myself so I merge latest develop into this PR branch and created new branch out it in roboflow/supervision repo

https://github.com/roboflow/supervision/tree/lazy-dataset-loading-updated

So you can try with shi/nms stuff too. I can also check images and OOM problem again. If you also have an idea please feel free to share or open PR if needed.

onuralpszr avatar Feb 12 '24 13:02 onuralpszr

@Alaaeldinn @onuralpszr Sorry for the problems you are facing. This is now over 5 months ago since I implemented and last used this. I didn't have OOM issues for my use case back then, but since the PRs are now so far behind and it's also possible that my solution was incomplete I think your approach of creating a new PR makes a lot of sense.

tfriedel avatar Feb 12 '24 14:02 tfriedel