cleanvision
cleanvision copied to clipboard
Extend CleanVision to run on video data
Some discussion of this here: https://github.com/cleanlab/cleanvision/discussions/214
Making CleanVision runnable on video would be super useful to many folks!
Should be fairly easy:
- just extract every k-th frame from the video,
- run cleanvision on all those images,
- aggregate the results across frames per video
I'd be super interested in this -- might come in handy when sampling for annotations etc. Potentially doing some on-the-fly deduplication using the perceptual hash to avoid sampling visually identical scenes, before passing in to do further analysis, could improve efficiency of the process.
@LemurPwned This would be an awesome contribution! Nobody is working on this actively AFAIK.
@lafemme was interested in trying CleanVision for video and could test out your PR code too!
@jwmueller Great, I started playing around with the idea and for simplicity sake I created a separate repo for the sampling part: https://github.com/LemurPwned/video-sampler. It seems that depending on the settings a significant reduction can take place. It's also pretty damn quick!
I'll try to integrate the sampling with the clean vision api in a PR soon.
awesome super excited about this!!
@lafemmephile thanks, just to be clear, you want to pick up from here?
@lafemmephile thanks for clarifying :) -- I'm fine with this, let me know if you need any help. As to the outline you mentioned, makes sense -- the trickest part is probably avoiding reimplementation of the features from Imagelab.
@lafemmephile I'm not a maintainer for this project, so I'm not sure to what extent my insight is useful here. But, I was essentially thinking of something similar to spacy's pipeline. A simple, minimal object which takes in the config for each step in the pipeline and then executes them in sequence.
So, I'd create a small video sampler object as a separate API, define its config, then write a simple pipeline executor instance that runs the sampler, saves the output and points image lab at it. In principle, prepping the config input correctly (as a kwargs dict or something to that extent) would decouple those objects from most changes in Imagelab
's object api.
And, it allows you to add more steps before/after if needed.
Hi @LemurPwned ! The video sampler works great. Could you explain in more details about the approach you're suggesting for integrating video sampler in cleanvision? More specifically a pseudocode for what the classes would look like?
@lafemmephile We could start with a VideoLab class, that'd take care of sampling frames from videos, saving them and running cleanvision on those frames. I'd suggest to also think about a few finer points while working on this extension.
- What issues should VideoLab check for? For example, odd_size would be rendered useless in the case of a video since all images will be the same size.
- The VideoLab code is decoupled from the Imagelab code as much as possible to avoid any breaking changes in Imagelab.
- What are the preprocessing steps needed for VideoLab? For instance, sampling frames is a preprocessing step.
- What's the structure of the data object that will carried through VideoLab. For example in Imagelab we use the Dataset object to store all information related to dataset. Can we use the same or need to make a new one?
- How would the visualizations of results look like for Videolab?
- How would one detect duplicate videos? We can start with image property issues to make it simple.
@lafemmephile Feel free to start more discussions on these points. I think we could start with a layout of the code first and then start filling in details for specific methods.
@lafemmephile your understanding is correct.
@sanjanag I was just thinking about a minimal implementation. Here's a more detailed outline.
- Modify FSDataset dataset slightly by allowing a switch from
IMAGE_FILE_EXTENSIONS
toVIDEO_FILE_EXTENSIONS
. I think other datasets classes don't really apply since we will be intentionally subsampling. - Create an object which you would call
VideoLab
that accepts saidDataset
object. - I'm not sure to what extent a concept of an issue applies to a video. For sure, in big, mixed datasets we may require a removal of overly dark, or blurry frames. Here lies the problem. If you decide to include more complex issue filters in the video, then running
ImageLab
-like afterwards may become redundant. Instead, I can see how one may be more inclined to keepImageLab
as your primary object where all the heavy duty analytics happens andVideoLab
does only minimal video deduplication such that the downstream tasks are not too time consuming. On the other hand you may opt for putting some more filtering into theVideoLab
object and fine-tune/optimise it for the video use case. I don't know which one is the right answer at this point -- but there are definitely some gains in optimising for the video sampling performance.
As to the API outline, I was thinking something relatively simple as this (the pipeline is very crude and not abstracted here):
from tempfile import TemporaryDirectory
VIDEO_FILE_EXTENSIONS = ["*.mp4", "*.avi", "*.mkv", "*.mov", "*.webm"]
class VideoPipeline:
def __init__(self, input_folder, output_folder) -> None:
# overwrite default allowed extensions
dataset = FSDataset(input_folder, allowed_extensions=VIDEO_FILE_EXTENSIONS)
# create a temporary folder to store the output
self.output_folder = output_folder
self.tmp_out = TemporaryDirectory()
self.video_lab = VideoLab(dataset=dataset, output_dataset=self.tmp_out)
self.image_lab = ImageLab(dataset=self.tmp_out)
def run(self, sampling_config, issue_config, n_jobs: int):
self.video_lab.sample(**sampling_config, n_jobs=n_jobs)
self.image_lab.find_issues(**issue_config, n_jobs=n_jobs)
# save images to the output folder, maybe move images?
self.image_lab.save(self.output_folder)
self.tmp_out.cleanup()
The VideoLab
would look very similar to what you've seen in the video-sampler
example repo.
Hey everyone,
I have some experience with videos, so I may have some suggestions that might be helpful:
- you guys can use I-frames instead of just randomly sampling some frames. Typically there is 1 I-frame per second, they are a bit less noisy compared to P or B-frames. One way to do this is using ffmpeg, something like this should work:
pip install ffmpeg-python
and
import ffmpeg
def extract_i_frames(input_video_path, output_directory):
# Create the output directory if it doesn't exist
if not os.path.exists(output_directory):
os.makedirs(output_directory)
# FFmpeg command to extract I-frames
ffmpeg.input(input_video_path).output(os.path.join(output_directory, 'frame%d.jpg'), vf='select=eq(pict_type\\,I)', vsync=0).run()
print('I-frames extracted successfully.')
# Example usage
input_video_path = 'input_video.mp4' # Replace with the path to your input video file
output_directory = 'output_frames' # Replace with the desired output directory
extract_i_frames(input_video_path, output_directory)
-
I would write an algorithm in a way that is easily extensible to other 3D volumes, such as CT-scan, and MRI. CT-scan is basically a couple of hundred layers of images. So, once
VideoPipeline
is completed, the algorithm should pretty much work for a CT scan with a small wrapper function. -
not sure how useful this would be for you guys but I implemented an algorithm recently that takes a video and using YOLO and DeepSORT, finds temporal inconsistencies in a video: https://github.com/smttsp/temporal_consistency_odt. I am not an expert in this field, so I cannot think of a way to integrate something like that into this repo, but just an idea.
Let me know if I can help in any other way
@smttsp in the video-sampler example repo I'm using keyframe decoding with pyav. That's what ffmpeg uses under the hood (not the Python version, but the C version). This gives you a programmatic access to frames as they are being decoded (and all of their metadata such as motion vectors) which I think is more flexible than using ffmpeg bindings.
Ad. 3 I took a look at your repo and it's super interesting! I'm contemplating extending the example repo with an arbitrary filter that could operate on K accumulated frames (so it's easily extendable) which would make it possible to operate with the combination of detection+tracking, like you did, on the fly.
@LemurPwned, cool, both sounds good.
Hi @lafemmephile ! The idea of this Github issue to extend what cleanvision detects to videos. Hence, we strictly want to focus on extending cleanvision issues for videos.
What issues (if any) must be developed exclusively for Videolab (i.e. new issues only used for video data)?
needs more brainstorming and we must take it up in a separate issue.
The class for sampling frames from videos is a good starting point and it seems like you also have a decent idea of what would go in VideoLab. I'd suggest that at this point you can create a PR, which would first detect image property issues in issues in videos like dark, light, etc.
Hi @smttsp ! Great to see your inputs on this issue. https://github.com/smttsp/temporal_consistency_odt looks great. Our team recently added support for object detection in cleanlab package. Seems like we could use your inputs there.
Hey @lafemmephile,
@smttsp Very interesting work with the temporal consistency ... I want to look over your work more in-depth.
Thank you! LMK if I can help in any way or if you can think of any feature that can be directly or indirectly used here. We can also think of incorporating the visualizations/exports.
Maybe we can begin to discuss what video issues make sense to start with
One of the major considerations is whether we will evaluate videos or individual frames.
I think frames would make more sense. Suppose that you have two 1-minute videos with 10 seconds overlap. So, the videos are near duplicates. But each video has 50 seconds of unique content. If we categorize the videos as near duplicates, then you need to discard one of them which doesn't make sense because we are discarding one of the 50 seconds (unique content of one of the videos).
If frames are in question, you can discard one of the 10 seconds and you are fine.
But the problem is then, what if we have a static video where there are several near duplicate images?
I like the following solution a lot:
If we export the frames in a smarter way, i.e., only export the unique frames, we can have intra-video (frame-wise) uniqueness.
Then near duplicate frames can only occur between two different videos.
simple pseudocode for a video frame extraction
frame_hash_set = set()
for frame in selected_frames:
if hash(frame) not in frame_hash_set:
frame_hash_set.add(hash(frame))
export frame
else:
continue
- How many issues currently used in Imagelab can be used directly without modification in Videolab?
- How many issues currently used in Imagelab can be used with some modification in Videolab?
- What issues (if any) must be developed exclusively for Videolab (i.e. new issues only used for video data)?
The above suggestion is basically converting a video into a set of unique frames which would enable to use all the features of imagelab. The only issue I see here is that the visualization of videos should be a bit different from images. Not sure how to do that in an elegant way but we can discuss that later on
Hi @smttsp ! Great to see your inputs on this issue. https://github.com/smttsp/temporal_consistency_odt looks great. Our team recently added support for object detection in cleanlab package. Seems like we could use your inputs there.
@sanjanag, of course, I would be happy to help!
@smttsp I am curious how your frame sampling approach differs from @LemurPwned's VideoSampler? Again, lots of really good ideas being put forth, I cannot wait to mature this current extended solution to the point where we can look at some of these more advanced video file diffs or video file introspection techniques. Very exciting stuff.
Actually, just read the VideoSampler code more thoroughly and it seems that it is already using p-hash. great job @LemurPwned 👏 👏 👏
How many issues currently used in Imagelab can be used directly without modification in Videolab?
I think all issues can be used if the goal is to build the tool on a frame-by-frame basis. But then, what is the point of calling it VideoLab
?
How many issues currently used in Imagelab can be used with some modification in Videolab?
If the goal is to compare videos, then all of the issues require some extra work. For example, exact duplication of videos is when all the frames of two videos are exactly the same, and near duplicate
is when at least k
of n
(above a threshold) frames are near duplicates.
Light, blurry, dark, and low information are all gonna be based on some threshold. We may have a few frames to be one of those but the rest of the video might be perfect.
What issues (if any) must be developed exclusively for Videolab (i.e. new issues only used for video data)
I think VideoLab should find both frame-related
and video-related
issues. Everything ImageLab does should be frame-by-frame basis. Along with those, we can come up with a couple of other issues specifically for videos. A few things I can think of are:
-
video-type
: static video vs non-static video or other types we can think of. Based on this, we can tell a video has low information vs high information. -
light/dark/low information
: if the entire video consists of such frames. There can be a little bit different naming such aslow-information-video
,saturated/overexposed-video
,dark-video
etc. -
poor quality or highly-compressed
videos -
very-long
videos (e.g. a 90-min football match, how much information will those frames (90x60x30 frames) add to model training). Say that we are getting 1 frame per second, it is 5400 frames.
@smttsp I think it's a very valid distinction -- what I don't have a clear idea on is how to do that effectively over a large dataset. For instance, taking the low-information case as an example, I can image we arrive at a 2D histogram wanting to compute the video entropy and, say, solve some constrained tree optimisation problem where you sample X frames with some minimal temporal distance apart (to avoid sampling from clusters of temporally close high entropy pictures). But, this probably decomposes into sampling with the same constraint first and then running image lab with low info issue? At the same time, it's probably more work to define that as a video issue.
On a sidenote, I'm be inclined to test it out in the demo repo but with a variant where the entropy is minimised over a finite buffer.
@lafemmephile There's a lot of nice ideas discussed here, but I still think your first PR should keep things simple. Future PRs can extend the set of video issues that can be detected further, but IMO it's best to focus on getting that first PR in now.
I think the first PR can stick with the original strategy I outlined:
- just extract every k-th frame from the video,
- run cleanvision on all those images,
- aggregate the results across the frames extracted per video, into scores per video.
Note the complexity is in Step 3. We need to define aggregation functions to get issue scores per video out of the individual frames' issue scores. For example: the dark-score for a video might be the 0.95-percentile of the frames' dark scores (so that we only say a video is dark if most of its frames are dark).
As to which issue types make sense here, I'd first stick with just a couple in the first PR (eg. light/dark, blurry). We don't need to support every CleanVision issue type in the 1st PR for Videolab
. I'd rather see just a couple issue types supported well, and then launch Videolab
, rather than spending a long time to refine all of the issue types for Videolab
simultaneously. It is better to add support for the rest of the CleanVision issue types in later PRs, and then add support for video-specific issue types in subsequent PRs after that.
@smttsp Regarding your point:
I think all issues can be used if the goal is to build the tool on a frame-by-frame basis. But then, what is the point of calling it VideoLab?
The point is the tool produces one score per video to quantify its overall quality in terms of each issue type. From the user's perspective they see the tool as analyzing the video. The fact that the tool is analyzing frames within the video is internal details abstracted away from the user. The tool is still providing nontrivial value here, figuring out how to best aggregate the cleanvision issue scores for each frame in the video into an overall score for the video requires effort on our part.
In future versions, we can also analyze entire video sequences as new issue types, but I would prefer to avoid this for now for simplicity of shipping v0 of Videolab
.