MoGe How to make output consistent over video?

On the website is video results, where the scale / shift are consistent across the video.

If I just run the method per frame, it is very obviously that these are not consistent.

Is there code to make it consistent?

Oct 30 '24 14:10 JonathonLuiten

Thank you for your interest in our project. We are currently in the process of cleaning up the code and will release it once it's ready. Below is a brief description of our implementation:

We maintain a list of point maps registered in world space.
For each frame in the video sequence:
- Estimate the camera-space point map of the current frame using MoGe.
- Select a few previous frames as references and compute dense image matching using PDCNet between the current frame and the reference frames. (Alternative robust image matching methods can also be used if you have more options)
- [KEY] Solve for point-set rigid transformation (1-DoF scale, 3-DoF rotation and 3-DoF translation) based on the matching results. RANSAC is employed to remove outliers. Code snippet for this part sees https://github.com/microsoft/MoGe/issues/6#issuecomment-2451412282
- Apply the calculated rigid transformation to the current frame's point map and append it to the registered point maps list.

Please note that achieving video consistency is NOT one of the primary objectives of MoGe. This simplified application is to demonstrate the potential for video reconstruction through a simplified implementation that considers only rigid registration. To enhance consistency, additional optimization techniques would be necessary.

See also a related issue(multiview reconstruction application).

Oct 31 '24 07:10 EasternJournalist

Thanks for your answer!

This makes lots of sense!

One follow up question: when you say 'rigid transformation' I assume you mean rotation + translation. Does that mean that we assume that the 'scale' is automatically globally consistent that comes out of the MoGe network? Is there a reason it should be consistent?

Oct 31 '24 18:10 JonathonLuiten

Thanks for your answer!

This makes lots of sense!

One follow up question: when you say 'rigid transformation' I assume you mean rotation + translation. Does that mean that we assume that the 'scale' is automatically globally consistent that comes out of the MoGe network? Is there a reason it should be consistent?

The rigid transformation here includes scale, rotation and translation. The raw output scale of MoGe is unconstrained and not consistent across video frames, since it has been trained to be scale-invariant for single images.

Our implementation for RANSAC rigid (similarity) registration is quite simple.

$p_i$: the current frame camera-space point;
$q_i$: matched reference frame world-space point.
$w_i$: inversely proportional to its depth.

The following code snippet solves the transformation (s, R, t) given two sets of 3D points $\{p_i\}_{i=1}^N,\{q_i\}_{i=1}^N$ and weighting $\{w_i\}_{i=1}^N$.

$$ \min_{s,\bf R,\bf t}\sum_{i=1}^Nw_i\Vert s\bf R\bf p_i+t-\bf q_i\Vert_2^2 $$

import numpy as np
from typing import *

def rigid_registration(
    p: np.ndarray, 
    q: np.ndarray, 
    w: np.ndarray = None, 
    eps: float = 1e-12
) -> Tuple[float, np.ndarray, np.ndarray]:
    if w is None:
        w = np.ones(p.shape[0])
    centroid_p = weighted_mean_numpy(p, w[:, None], axis=0)
    centroid_q = weighted_mean_numpy(q, w[:, None], axis=0)

    p_centered = p - centroid_p
    q_centered = q - centroid_q
    w = w / (np.sum(w) + eps)
        
    cov = (w[:, None] * p_centered).T @ q_centered
    U, S, Vh = np.linalg.svd(cov)
    R = Vh.T @ U.T
    if np.linalg.det(R) < 0:
        Vh[2, :] *= -1
        R = Vh.T @ U.T
    scale = np.sum(S) / np.trace((w[:, None] * p_centered).T @ p_centered)
    t = centroid_q - scale * (centroid_p @ R.T)
    return scale, R, t


def rigid_registration_ransac(
    p: np.ndarray,
    q: np.ndarray,
    w: np.ndarray = None,
    max_iters: int = 20,
    hypothetical_size: int = 10,
    inlier_thresh: float = 0.02
) -> Tuple[float, np.ndarray, np.ndarray]:
    n = p.shape[0]
    if w is None:
        w = np.ones(p.shape[0])
    
    best_score, best_inlines = 0., np.zeros(n, dtype=bool)
    best_solution = (np.array(1.), np.eye(3), np.zeros(3))

    for _ in range(max_iters):
        maybe_inliers = np.random.choice(n, size=hypothetical_size, replace=False)
        try:
            s, R, t = rigid_registration(p[maybe_inliers], q[maybe_inliers], w[maybe_inliers])
        except np.linalg.LinAlgError:
            continue
        transformed_p = s * p @ R.T + t
        errors = w * np.linalg.norm(transformed_p - q, axis=1)
        inliers = errors < inlier_thresh
        
        score = inlier_thresh * n - np.clip(errors, None, inlier_thresh).sum()
        if  score > best_score:
            best_score, best_inlines = score, inliers
            best_solution = rigid_registration(p[inliers], q[inliers], w[inliers])
    
    return best_solution, best_inlines

Nov 01 '24 07:11 EasternJournalist

@EasternJournalist Hi, thanks for your great contribution! You mentioned that the scale, rotation, and translation can be solved based on matching results. Is there any recommended library or GitHub repo to realize the RANSAC algorithm? By the way, the traditional methods seldom optimize the "depth scale" of monocular depth as far as I know, how can I consider it as optimizable to the existing code base? Thanks so much for your soon reply!

Nov 06 '24 09:11 guangkaixu

@guangkaixu Hi. The previous comment has been updated. The code for RANSAC is now shared in the code snippet.

Nov 07 '24 07:11 EasternJournalist

Thanks! It will be helpful and I'll have a try.

By the way, after I computed depth, camera intrinsic, and pose, is there any appropriate method to perfrom RGB-D fusion? I tried tsdf-fusion, but I'm afraid the shortcomings of the huge GPU memory requirement and the existance of hollow cave of the fused mesh are less satisfactory.

Nov 11 '24 06:11 guangkaixu

Thanks! It will be helpful and I'll have a try.

By the way, after I computed depth, camera intrinsic, and pose, is there any appropriate method to perfrom RGB-D fusion? I tried tsdf-fusion, but I'm afraid the shortcomings of the huge GPU memory requirement and the existance of hollow cave of the fused mesh are less satisfactory.

That is quite tricky. TSDF fusion is apparently not applicable for large-scale SLAM, especially with dynamic scenes. I am working on to find a convenient alternative too.

Nov 19 '24 06:11 EasternJournalist

Hi @EasternJournalist , Thanks for your great work! Will the code for processing video be released anytime soon?

Dec 11 '24 08:12 kexul

Hi, has anyone been able to use it?

Feb 20 '25 10:02 sctrueew

Hi @EasternJournalist,

I have implemented some code for a SLAM system and would like to ensure that I’m on the right track. Could you please review it and let me know if the implementation is correct or suggest improvements?

Your guidance would be greatly appreciated. Thank you!


import os
from pathlib import Path
import sys
if (_package_root := str(Path(__file__).absolute().parents[2])) not in sys.path:
    sys.path.insert(0, _package_root)

import cv2
import torch
import numpy as np
from tqdm import tqdm
from pathlib import Path
from moge.model import import_model_class_by_version
from utils3d.numpy import intrinsics_to_fov
import open3d as o3d
from typing import Tuple

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ---------- Rigid Registration with Scale (RANSAC) ----------
def rigid_registration(p: np.ndarray, q: np.ndarray, w: np.ndarray = None, eps: float = 1e-12) -> Tuple[float, np.ndarray, np.ndarray]:
    if w is None:
        w = np.ones(p.shape[0])
    centroid_p = np.average(p, axis=0, weights=w)
    centroid_q = np.average(q, axis=0, weights=w)

    p_centered = p - centroid_p
    q_centered = q - centroid_q
    w = w / (np.sum(w) + eps)

    cov = (w[:, None] * p_centered).T @ q_centered
    U, S, Vh = np.linalg.svd(cov)
    R = Vh.T @ U.T
    if np.linalg.det(R) < 0:
        Vh[2, :] *= -1
        R = Vh.T @ U.T
    scale = np.sum(S) / np.trace((w[:, None] * p_centered).T @ p_centered)
    t = centroid_q - scale * (centroid_p @ R.T)
    return scale, R, t

def rigid_registration_ransac(
    p: np.ndarray,
    q: np.ndarray,
    w: np.ndarray = None,
    max_iters: int = 10,
    hypothetical_size: int = 6,
    inlier_thresh: float = 0.02
) -> Tuple[Tuple[float, np.ndarray, np.ndarray], np.ndarray]:
    n_p = p.shape[0]
    n_q = q.shape[0]
    n = min(n_p, n_q)  # Use smaller point cloud size for sampling
    if w is None:
        w = np.ones(n_p)

    best_score, best_inliers = 0., np.zeros(n_p, dtype=bool)
    best_solution = (1.0, np.eye(3), np.zeros(3))

    # Build Open3D KDTree for q
    q_pcd = o3d.geometry.PointCloud()
    q_pcd.points = o3d.utility.Vector3dVector(q)
    tree = o3d.geometry.KDTreeFlann(q_pcd)

    for _ in range(max_iters):
        maybe_inliers = np.random.choice(n, size=min(hypothetical_size, n), replace=False)
        try:
            s, R, t = rigid_registration(p[maybe_inliers], q[maybe_inliers], w[maybe_inliers])
        except np.linalg.LinAlgError:
            continue
        transformed_p = s * p @ R.T + t

        # Compute nearest-neighbor distances using Open3D KDTree
        distances = np.zeros(n_p)
        indices = np.zeros(n_p, dtype=int)
        for i in range(n_p):
            _, idx, dist = tree.search_knn_vector_3d(transformed_p[i], 1)
            indices[i] = idx[0]
            distances[i] = np.sqrt(dist[0])  # Open3D returns squared distances
        errors = w * distances
        inliers = errors < inlier_thresh

        score = inlier_thresh * n_p - np.clip(errors, None, inlier_thresh).sum()
        if score > best_score:
            best_score = score
            best_inliers = inliers
            best_solution = rigid_registration(p[inliers], q[indices[inliers]], w[inliers])
    
    return best_solution, best_inliers

# ---------- Load image ----------
def load_image_tensor(path, resize_to=None):
    image = cv2.cvtColor(cv2.imread(str(path)), cv2.COLOR_BGR2RGB)
    if resize_to:
        h, w = image.shape[:2]
        image = cv2.resize(image, (resize_to, int(resize_to * h / w)), cv2.INTER_AREA)
    image_tensor = torch.tensor(image / 255.0, dtype=torch.float32).permute(2, 0, 1)
    return image_tensor, image

# ---------- Run MoGe ----------
def run_moge_on_image(model, image_tensor, fov_x=None, use_fp16=False):
    if use_fp16:
        image_tensor = image_tensor.half()
    with torch.no_grad():
        output = model.infer(image_tensor.to(device), fov_x=fov_x, use_fp16=use_fp16)
    return {k: v.cpu().numpy() for k, v in output.items()}

# ---------- Create SLAM Scene ----------
def create_slam_scene(image_folder, model, fov_x=None, resize_to=1024):
    scene_points = []
    scene_colors = []
    previous_points_world = None

    image_paths = sorted(Path(image_folder).glob("*.png"))

    for idx, image_path in enumerate(tqdm(image_paths, desc="SLAM")):
        image_tensor, image = load_image_tensor(image_path, resize_to)
        output = run_moge_on_image(model, image_tensor, fov_x)

        depth = output["depth"]
        mask = output["mask"]
        intrinsics = output["intrinsics"]
        points_camera = output["points"]  # shape: (H, W, 3)

        # Masking and reshaping
        mask = mask > 0
        points = points_camera[mask]
        colors = image[mask]

        # Downsample point cloud for performance
        pcd = o3d.geometry.PointCloud()
        pcd.points = o3d.utility.Vector3dVector(points)
        pcd.colors = o3d.utility.Vector3dVector(colors / 255.0)
        pcd = pcd.voxel_down_sample(voxel_size=0.1)  # Adjust voxel_size as needed
        points = np.asarray(pcd.points)
        colors = np.asarray(pcd.colors) * 255.0

        if idx == 0:
            scene_points.append(points)
            scene_colors.append(colors)
            previous_points_world = points
        else:
            # Downsample previous points
            pcd_prev = o3d.geometry.PointCloud()
            pcd_prev.points = o3d.utility.Vector3dVector(previous_points_world)
            pcd_prev = pcd_prev.voxel_down_sample(voxel_size=0.1)
            q = np.asarray(pcd_prev.points)

            p = points
            sRt, inliers = rigid_registration_ransac(p, q)
            s, R, t = sRt

            aligned_points = s * p @ R.T + t
            scene_points.append(aligned_points)
            scene_colors.append(colors)
            previous_points_world = aligned_points

    return np.vstack(scene_points), np.vstack(scene_colors)

# ---------- Save to PLY ----------
def save_to_ply(points, colors, save_path):
    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(points.astype(np.float64))
    pcd.colors = o3d.utility.Vector3dVector((colors / 255).astype(np.float64))
    o3d.io.write_point_cloud(str(save_path), pcd)

# ---------- Main ----------
if __name__ == "__main__":
    model = import_model_class_by_version("v2").from_pretrained(r"moge-2-vitl.pt").to(device).eval()
    model.half()

    image_folder = r".\i"
    points, colors = create_slam_scene(image_folder, model, resize_to=1024)
    save_to_ply(points, colors, "scene_output.ply")

Jul 20 '25 05:07 sctrueew

Thank you for your interest in our project. We are currently in the process of cleaning up the code and will release it once it's ready. Below is a brief description of our implementation:

We maintain a list of point maps registered in world space.

For each frame in the video sequence:

Estimate the camera-space point map of the current frame using MoGe.

Select a few previous frames as references and compute dense image matching using PDCNet between the current frame and the reference frames. (Alternative robust image matching methods can also be used if you have more options)

[KEY] Solve for point-set rigid transformation (1-DoF scale, 3-DoF rotation and 3-DoF translation) based on the matching results. RANSAC is employed to remove outliers. Code snippet for this part sees How to make output consistent over video? #6 (comment)

Apply the calculated rigid transformation to the current frame's point map and append it to the registered point maps list.

Please note that achieving video consistency is NOT one of the primary objectives of MoGe. This simplified application is to demonstrate the potential for video reconstruction through a simplified implementation that considers only rigid registration. To enhance consistency, additional optimization techniques would be necessary.

See also a related issue(multiview reconstruction application).

In the autonomous driving scenario, I align the point cloud of each frame of moge2 with the lidar, render the point cloud video, but the rendered video shakes a lot and there are serious inconsistencies. Do you have any suggestions?

Aug 25 '25 01:08 sunshinejuanjuan123

In the autonomous driving scenario, I align the point cloud of each frame of moge2 with the lidar, render the point cloud video, but the rendered video shakes a lot and there are serious inconsistencies. Has anyone in the community run this alignment code and can show the corresponding input video output?

Aug 25 '25 01:08 sunshinejuanjuan123