supervision
supervision copied to clipboard
Lazy dataset loading
Description
implements #316 The "images" dict in DetectionDataset and ClassificationDataset that maps from str to ndarray was replaced by LazyLoadDict where the setter just sets filenames but the getter loads the image. So it only keeps track of filenames instead of image contents which keeps memory usage low and allows larger datasets that are not required to fit in memory.
Type of change
Please delete options that are not relevant.
- [x] Bug fix (breaking change which fixes an issue)
(It is a breaking change though, as the interface changes)
How has this change been tested, please provide a testcase or example of how you tested the change?
Some unit tests were modified and it was tested with a modified version of autodistill. I'm not claiming nothing breaks as I'm not aware of all use cases nor have I tested all. I only tested autodistill with groundedsam for object detection using masks and yolov8. You are encouraged to do further testing. Also the issue of high memory consumption is still present with the annotations attribute of DetectionDataset. It could be addressed in a similar way or with a shelve instead of a dict.
Docs
- [ ] Docs updated? What were the changes: Doc have not been updated, but since the interface was changed (images now needs to be LazyLoadDict instead of a regular dict) any documentation regarding this still needs to be changed.
@tfriedel Thanks for PR. We will take a look. It might takes a bit. @capjamesg Can you also review this as it might affects autodistill? @onuralpszr Do you want to take care of this for review?
@hardikdava taking review
I am excited about this change! I have run into Out of Memory errors in Colab when working with large datasets with supervision
. The only Autodistill use I can find is:
dataset = sv.DetectionDataset(
self.ontology.classes(), images_map, detections_map
)
dataset.as_yolo(
output_folder + "/images",
output_folder + "/annotations",
min_image_area_percentage=0.01,
data_yaml_path=output_folder + "/data.yaml",
)
As long as the as_yolo
method has been tested and will continue to work as normal, autodistill
should not be impacted.
Sounds fine. @capjamesg can you share colab to test the PR if possible?
@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.
@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.
I tested with medium size data i will do test bit more and post my result today plus irl work i had to finish
Initial Memory Usage results in loading of images
Before
Images size: 18520 bytes
Images size: 0.01766204833984375 MB
Images size: 1.7248094081878662e-05 GB
After
Images size: 48 bytes
Images size: 4.57763671875e-05 MB
Images size: 4.470348358154297e-08 GB
I also did memray test with graphical memory profiler too see heap size and resident size Basically we run python script for load the dataset and exit the program
Before HEAP and Resident size
After HEAP and Resident size
Script I use for load the dataset and I used roboflow script the download the datasets
import sys
import supervision as sv
dataset_location = "fashion-assistant-segmentation-5"
ds = sv.DetectionDataset.from_yolo(
images_directory_path=f"{dataset_location}/train/images",
annotations_directory_path=f"{dataset_location}/train/labels",
data_yaml_path=f"{dataset_location}/data.yaml",
)
# memory usage of the dataset
print(f"Dataset size: {sys.getsizeof(ds)} bytes")
# Convert to MegaBytes (MB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024 / 1024} GB")
# ds.images memory usage
print(f"Images size: {sys.getsizeof(ds.images)} bytes")
# Convert to MegaBytes (MB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024 / 1024} GB")
@onuralpszr I think you can not use getsizeof to measure the size of the LazyLoadDict, because:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to. See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.
from: https://docs.python.org/3/library/sys.html
Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.
@onuralpszr I think you can not use getsizeof to measure the size of the LazyLoadDict, because:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to. See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.
from: https://docs.python.org/3/library/sys.html
Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.
It does get bigger, I used 10k image set (before dataset)
Images size: 295000 bytes Images size: 0.28133392333984375 MB Images size: 0.00027474015951156616 GB
@tfriedel @onuralpszr I tested this PR. I would say it is better than exisiting solution. @SkalskiP please take a look also as this might needs in changing API a bit. But the solution works quite well on large dataset.
@tfriedel iterator on loaded dataset is not working at all. Can you take a look? It is quite important and used many places. It would be great if one can tests all the features of sv.DetectionDataset
and sv.ClassificationDataset
.
Code to reproduce my issue.
import supervision as sv
data = "train2017"
ds = sv.DetectionDataset.from_yolo(
images_directory_path=f"../../dataset/coco/images/{data}",
annotations_directory_path=f"../../dataset/coco/labels/{data}",
data_yaml_path=f"../../supervision/data/coco128.yaml",
)
for image_name, image, labels in ds:
print(f"{image_name} : {image.shape}, {len(labels)}")
@hardikdava I can not reproduce your problem. For me this works fine. I downloaded the dataset from https://www.kaggle.com/datasets/ultralytics/coco128 and https://raw.githubusercontent.com/ultralytics/ultralytics/main/ultralytics/cfg/datasets/coco128.yaml and adapted the paths, otherwise the same code.
@tfriedel then this might be something wrong from my end. Just a question, can we use lazydict for detections as well? Then, the increase in memory will be solved. What do you think?
@hardikdava for detections see the PR in https://github.com/autodistill/autodistill/pull/48/files where I just store them in a shelve (basically a dict that is not in memory but on-disk). We can not use LazyLoadDict as the detections don't correspond to files (I think?). So you need a way to write them to disk. You could do that first and then use a LazyLoadDict. I'm not super happy with the shelve solution because it's just using a temporary file (maybe on a disk with not enough space) and it can also be problematic if later that object is going to be accessed by multiple threads etc. But it's a quick fix and solved the problem for me. But it's surely worth thinking of better options for this. Another thing that could be improved is the opening of the images in from_yolo. So if you have a large dataset, every image will be read in full only to read height & width. So even though we have a lazy loading solution, it will still be slow for a large dataset. I usually would use Image.open() from PIL, as that would just read the height & width info from the header only and thus would be much faster. Not even sure if height & width needs to be read. Anyway, I didn't want to change too much because I don't know your reasons for going with OpenCV and maybe adding the Pillow dependency is unwanted.
@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?
@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?
Let me re look as well
in datasets.formats.yolo when loading yolo annotations the method is using cv2.imread to get image shape, i suggest using some other method to get the shape so it will be faster for large datasets
@onuralpszr @hardikdava Any update on this? OOM'd on my end when using sv.DetectionDataset.from_yolo
Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.
I see that images are only opened with cv2
when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:
- Load an image
- Get an annotation
- Save the annotation in the dataset object
Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?
Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.
Yeah getting this working would be amazing! This is a huge blocker for me
I see that images are only opened with
cv2
when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:
- Load an image
- Get an annotation
- Save the annotation in the dataset object
Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?
Seeing that it OOM's, I'm guessing it doesn't?
@tfriedel @onuralpszr , i have tested the lazy-dataset-loading branch of supervision, autodistill repos by @tfriedel , stil can't process 2k of images dataset , OOM error, it's not how it handles image_path in dict structure but how it process reading image using cv2 and saving it , it looks like eating memory .
so to recap how things doing under the hood
- reading images dir
- iterate all paths and saving it in :: lazy (dict) 3.read every image with cv2.imread
- annotate
- save annotate -> detection map
- generate supervision dataset
- save images dataset , writing images in labeled dir
something is missing causing all of this OOM
@tfriedel the two repos is ton of commits behind so no shi or nms so it's a little bit tricky , @capjamesg alot of debugging but if autodistill treats how it deals with large dataset , with all features included it , would be another level
@Alaaeldinn thank you for debugging and inside and as for start let me give you quick update for first. Looks like I can't update that branch myself so I merge latest develop into this PR branch and created new branch out it in roboflow/supervision repo
https://github.com/roboflow/supervision/tree/lazy-dataset-loading-updated
So you can try with shi/nms stuff too. I can also check images and OOM problem again. If you also have an idea please feel free to share or open PR if needed.
@Alaaeldinn @onuralpszr Sorry for the problems you are facing. This is now over 5 months ago since I implemented and last used this. I didn't have OOM issues for my use case back then, but since the PRs are now so far behind and it's also possible that my solution was incomplete I think your approach of creating a new PR makes a lot of sense.