cleanvision icon indicating copy to clipboard operation
cleanvision copied to clipboard

In-memory statistics calculation

Open MichalOleszak opened this issue 1 year ago • 4 comments

Hello,

Do you support in-memory computation of statistics, or are you planning to add such a feature?

Details

I'm missing the possibility to obtain statistics like the ones returned by imagelab.get_stats() for an image that is not stored in a filesystem, but rather is kept in memory.

Let's say I have a vision model deployed and it receives an image for inference via a REST API. The image is a numpy array or a PIL Image. I'd like to be able to obtain the statistics for it before passing it to the model for inference. A working solution I came up with is saving the image to a tempdir and calling cleanvision on it, but this unsurprisingly is very slow.

In case you are not planning on developing such a feature, could you please advise on a faster workaround than using tempdir? Thanks!

MichalOleszak avatar Jul 10 '23 09:07 MichalOleszak

Hi @MichalOleszak ! Thanks for your question. You can use cleanvision on in-memory images by wrapping them in a hugginface Dataset object. Here's a code snippet doing that

from PIL import Image
import os
from datasets import Dataset
from cleanvision import Imagelab

if __name__ == "__main__":
    # loading images in-memory
    files = os.listdir("./tests/data")
    fpaths = [os.path.join("./tests/data", f) for f in files]
    image_list = [Image.open(f) for f in fpaths]
    
    # construct in-memory dataset
    mydict = {"image": image_list}
    dataset = Dataset.from_dict(mydict)
    
    # call cleanvision on this dataset
    imagelab = Imagelab(hf_dataset=dataset, image_key="image")
    imagelab.find_issues()
    imagelab.report()
    print(imagelab.get_stats())

sanjanag avatar Jul 10 '23 12:07 sanjanag

Hey @sanjanag,

Thanks a lot for a quick reply!

The solution you suggest works well, but from my quick&dirty experiments it seems to follow that for a single image (which is the use case I'm the most interested in) it's actually slower than dumping to a tempdir.

I assume you are not planning to expose APIs in the form of get_brightness(img: Image) -> float?

MichalOleszak avatar Jul 10 '23 13:07 MichalOleszak

Hi @MichalOleszak ! That sure looks like a good use case. We already have the code for computing these stats in bulk but not per image. But it should not be difficult to get those. You can find related code in image_property.py. If you take a look at the implemented ImageProperty classes, the calculate() method computes the raw value of the statistic and the get_scores() method converts it into a score between 0 and 1. Would you be interested in working on exposing these statistics methods from the package for the use case mentioned above?

sanjanag avatar Jul 12 '23 13:07 sanjanag

See also: https://github.com/cleanlab/cleanvision/issues/210

jwmueller avatar Jul 13 '23 15:07 jwmueller