lightning-pose icon indicating copy to clipboard operation
lightning-pose copied to clipboard

Can frame extraction before labeling be done via command line or even sped up with DALI?

Open Wulin-Tan opened this issue 1 year ago • 12 comments

Hi, LP team: can lightning pose extract frames by command line or even can the extracting be speeded up by using dali? It seems that the extracting is also time consuming, especially when doing k-means clustering like DLC.

Wulin-Tan avatar Jun 19 '24 01:06 Wulin-Tan

@Wulin-Tan are you referring to a DLC function or a lightning-pose/Pose-app function?

I should mention that in the Pose-app repo we have functions for extracting frames from raw videos and saving them. I initially implemented this using DALI but believe it or not I found OpenCV to be slightly faster (which is what is implemented now). Much of the time required for this operation isn't the video reading but rather running PCA+kmeans on the downsampled frames.

themattinthehatt avatar Jun 19 '24 13:06 themattinthehatt

Hi, @themattinthehatt

  1. exactly! I recently tried the extracting and found that there was no advantage for dali. And the most time consuming step is the clustering for frames, so now what I am doing is to do resizing and grayscale on cpu, and then do the clustering on gpu with python packages like skorch. This would make the preprocessing before extracting much faster(In my toy data, 10min down to 20s). I am wondering whether there are independent functions in LP to do the preprocessing like resizing and grayscale even on gpu?
  2. do you mean that I need to install Pose-app to get the extracting function?

Wulin-Tan avatar Jun 19 '24 15:06 Wulin-Tan

oh wow 10 min -> 20 s is an impressive speedup! I'll have to look into skorch, thanks for the pointer.

The resizing and grayscale happens on GPU with DALI, that's why it's surprising that it's not faster than OpenCV 🤷‍♂️

You wouldn't necessarily need to install Pose-app if you just wanted to use that function, you could just copy-paste into your own pipeline - but it sounds like your skorch implementation is much faster than our opencv+sklearn function, so you should just use your own!

One thing that you'll find in that function that you should think about as well - we don't actually cluster every single frame, only frames with high motion energy. The idea is that the animal might spend a lot of time sitting still, and therefore there will be many redundant frames. So we filter out frames where there is not a lot of movement and only cluster frames where the animal is ostensibly moving, which gives us more diverse frames and is faster.

themattinthehatt avatar Jun 19 '24 15:06 themattinthehatt

Hi, @themattinthehatt here is the code I tried.

%%time
#GPU
import cv2
import numpy as np
import torch
import logging
from torch import nn
from sklearn.base import BaseEstimator, ClusterMixin
from tqdm import tqdm

logging.basicConfig(level=logging.INFO)

# CPU-based frame resizing
def resize_frame_cpu(frame, size=(256, 256)):
    resized_frame = cv2.resize(frame, size)
    return resized_frame

# CPU-based frame conversion to grayscale
def convert_to_grayscale_cpu(frame):
    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    return gray_frame

# Read video frames
def read_video(video_path):
    cap = cv2.VideoCapture(video_path)
    frames = []
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        frames.append(frame)
    cap.release()
    return frames

# Process video frames using CPU
def process_frames_cpu(frames, batch_size=5):
    processed_frames = []
    for i in tqdm(range(0, len(frames), batch_size), desc="Processing frames"):
        batch_frames = frames[i:i+batch_size]
        for frame in batch_frames:
            try:
                resized_frame = resize_frame_cpu(frame)
                gray_frame = convert_to_grayscale_cpu(resized_frame)
                processed_frames.append(gray_frame.flatten())
            except Exception as e:
                logging.error(f"Error processing frame: {e}")
                continue
    return np.array(processed_frames)

# Custom K-means implementation using PyTorch
class KMeansModel(nn.Module):
    def __init__(self, n_clusters, max_iter=100):
        super(KMeansModel, self).__init__()
        self.n_clusters = n_clusters
        self.max_iter = max_iter

    def forward(self, X):
        device = X.device
        indices = torch.randperm(X.size(0), device=device)[:self.n_clusters]
        centroids = X[indices]

        for _ in range(self.max_iter):
            distances = torch.cdist(X, centroids)
            labels = torch.argmin(distances, dim=1)
            new_centroids = torch.stack([X[labels == j].mean(dim=0) for j in range(self.n_clusters)])
            if torch.all(centroids == new_centroids):
                break
            centroids = new_centroids

        return labels, centroids

# Skorch wrapper for the custom K-means model
class SkorchKMeans(BaseEstimator, ClusterMixin):
    def __init__(self, n_clusters, max_iter=100, device='cpu'):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.device = device
        self.model = KMeansModel(n_clusters, max_iter).to(device)

    def fit(self, X, y=None):
        self.model.train()
        X = X.to(self.device)
        self.labels_, self.centroids_ = self.model(X)
        return self

    def predict(self, X):
        self.model.eval()
        X = X.to(self.device)
        with torch.no_grad():
            labels, _ = self.model(X)
        return labels.cpu().numpy()

# Main function to integrate everything
def main(video_path, n_clusters=5):
    logging.info("Reading video frames...")
    frames = read_video(video_path)
    logging.info(f"Read {len(frames)} frames")

    logging.info("Processing frames on CPU...")
    processed_frames = process_frames_cpu(frames, batch_size=5)
    logging.info(f"Processed {processed_frames.shape[0]} frames")

    # Convert processed frames to a format suitable for PyTorch
    processed_frames_torch = torch.tensor(processed_frames, dtype=torch.float32)

    # Check if CUDA is available and set the device
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    logging.info(f"Using device: {device}")

    logging.info("Performing K-means clustering...")
    kmeans = SkorchKMeans(n_clusters=n_clusters, max_iter=100, device=device)
    kmeans.fit(processed_frames_torch)
    labels = kmeans.predict(processed_frames_torch)
    logging.info(f"Clustering completed. Cluster labels: {labels}")

if __name__ == "__main__":
    video_path = "videos/xxxxx.mp4"
    main(video_path)

I just asked chatgpt to generate the code and found that it can run on GPU. I haven't check in detail yet. please be careful about it. By the way, there is another way recommended is the cuML in RAPIDS, I have tried cuML a few years ago on single cell sequencing data(high dimensional data), it worked perfectly. But haven't try cuML on the behavior data yet.

Wulin-Tan avatar Jun 19 '24 16:06 Wulin-Tan

thanks for sharing, I'll definitely look into this!

themattinthehatt avatar Jun 19 '24 16:06 themattinthehatt

Hi, @themattinthehatt just to update, I highly recommend cuml for now.

  1. I check the code in detail and found that it is easy to overwhelm the GPU memory with a lot of GPU accelerated methods (that means it is good to run on toy data, but might crash if run on real video data).
  2. rapids method like cuml has special advantage on GPU memory management by rmm(RAPIDS Memory Manager).
  3. to handle one of my real video datasets, with same hardware and same setting(like 100 clusters, max_iteration=100...), scikit-learn takes 8min48s, and cuml takes 28.6s.
  4. my cuml code:
%%time
import cupy as cp
import cuml
from cuml.preprocessing import StandardScaler
from cuml.cluster import KMeans

# Assuming features is already defined and available as a NumPy array

# Scale features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Transfer scaled features to GPU
features_scaled_gpu = cp.array(features_scaled)

# KMeans with cuML
kmeans = KMeans(n_clusters=100, init='k-means++', max_iter=100)
kmeans.fit(features_scaled_gpu)

# Get the cluster centers and labels
centers = kmeans.cluster_centers_
labels = kmeans.labels_

print("Cluster centers:", centers)
print("Assignments:", labels)

Wulin-Tan avatar Jun 21 '24 16:06 Wulin-Tan

cool, that's quite the speedup! do you seem to get a diverse set of frames from this procedure?

themattinthehatt avatar Jun 21 '24 18:06 themattinthehatt

cool, that's quite the speedup! do you seem to get a diverse set of frames from this procedure?

Hi, @themattinthehatt what do you mean by 'a diverse set of frames'? do you mean the frame label result? I run labels[0:500], it shows: (my video is about 18000 frames, I grayscale it and downgrade it to 256X256 to fit my gpu)

array([38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,
       38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,
       38, 38, 38, 38, 38, 38, 38,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3, 20, 20, 20, 20, 20, 20, 20,
       20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
       20, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 32, 32, 74, 74, 74, 74, 74, 74, 74, 74,
       74, 74, 74, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
       96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
       96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
       96, 96, 96, 96, 96, 96, 96, 96, 31, 31, 31, 31, 31, 31, 31, 31, 31,
       31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 74, 74,
       74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
       74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
       74, 74, 74, 74, 74, 74, 74, 74, 74,  8,  8,  8,  8,  8,  8,  8,  8,
        8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
        8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
        8,  8,  8,  8,  8,  8,  8], dtype=int32)

Wulin-Tan avatar Jun 21 '24 23:06 Wulin-Tan

You're doing this frame extraction for labeling new frames, right? I'm curious if you extract 100 frames in this way if they all look visibly different from each other - the mouse is in different parts of the arena, in different poses, etc.

themattinthehatt avatar Jun 22 '24 00:06 themattinthehatt

You're doing this frame extraction for labeling new frames, right? I'm curious if you extract 100 frames in this way if they all look visibly different from each other - the mouse is in different parts of the arena, in different poses, etc.

Hi, @themattinthehatt

  1. yes, I hope to accelerate the extraction step before labeling. The problem of certain methods like DLC is that it does not store the clusters, I have to wait for the clustering again if I need more representative images.
  2. I checked the video and found that poses, position, location seem to all contribute to the clustering, and I could not tell which one would contribute more if only by eyes.
  3. I check your code and think it is more reasonable since you include pca and focus on high dynamic state.
  4. can you give more details about how to make use of those functions in extract_frames.py?

Wulin-Tan avatar Jun 22 '24 15:06 Wulin-Tan

Hi, @themattinthehatt I check extract_frames.py in details and found it is very good. The idea of taking high energy frames is super cool. That can save a lot of time. Even running pca/kmeans on a cpu, it is already efficient enough(I tried 500 clusters). I would suggest to include all these functions not only in LP app but also in LP itself. But still if you can offer more details or examples in the LP tutorial, that would be pretty good. I just want to make sure I do not have misunderstanding about your code. and by the way in the find_contextual_frames functions, if the center frame is 1, then it would get [0,1,2,3] (-1 is excluded), and this [0,1,2,3] is not five-frame chunk, so will it still be included in the LP context model? The find_contextual_frames function also checks for groups of consecutive frames. If a group has 5 or more consecutive frames, it removes the first two and last two frames from that group. what purpose is this function for?

Wulin-Tan avatar Jun 23 '24 04:06 Wulin-Tan

@Wulin-Tan glad you found the functions in extract_frames.py helpful. We've considered putting them in the LP repo itself but since we don't also offer labeling tools in LP we decided to group all of the non-training/inference code together in the app.

Re find_contextual_frames - our documentation is not so good yet with the app code. This function is actually for when people are uploading zipped folders of frames into the app for labeling - we want to check if they have included context frames or not. If we decide context frames have been included, this function returns only the frames that need to be labeled, and not their context, so that we can upload the proper frames into label studio. For the specific case you raised when the center frame is 1, the LP code will see that the first context frame should be -1 and just loads a second copy of frame 0 instead, so there will be no errors.

One question for you - are you attempting to label 500 frames from a single video? If so one piece of advice I have is to label fewer frames from more videos than many frames from a single video - this leads to better results on new animals. So if you're labeling budget is 1000 frames, it is better to label 50 frames from 20 animals rather than 500 frames from 2 animals.

themattinthehatt avatar Jun 24 '24 13:06 themattinthehatt