Match the output of scipy.ndimage.zoom on multimodality's vision module `resize_image()`

Open Blaizzy opened this issue 10 months ago • 1 comments

Scipy was deprecated in https://github.com/Blaizzy/mlx-vlm/pull/301

And replaced with PIL and CV2. However, they are not numerically identical or within the threshold.

The replacements work well and don't miss details, but I would like to approximate scipy.ndimage.zoom as much as possible since the source implementation uses it and to avoid edge cases.

Apr 17 '25 13:04 Blaizzy

The closest implementation:

def resize_image(image_np, new_size=(96, 96), order=1):
    """
    Resize an image with multiple channels using PIL.

    Args:
    image_np (numpy.ndarray): The input image array of shape (height, width, channels).
    new_size (tuple): The target size as (height, width).
    order (int): The order of interpolation (used to determine resampling method).

    Returns:
    numpy.ndarray: The resized image array in the same format as input.
    """
    image_np = image_np[0]
    # Get dimensions
    height, width, channels = image_np.shape

    # Choose interpolation method based on order parameter
    resample_method = Resampling.BILINEAR  # Default to bilinear
    if order == 0:
        resample_method = Resampling.NEAREST
    elif order == 2 or order == 3:
        resample_method = Resampling.BICUBIC

    # Handle different channel configurations
    if channels == 1:
        # For single-channel images (grayscale)
        # Reshape to 2D array (height, width)
        image_2d = image_np.reshape(height, width)

        # Create PIL image - ensure proper mode and data type conversion
        pil_image = Image.fromarray(image_2d.astype(np.float32))

        # Resize using PIL (note: PIL takes width, height order)
        resized_pil = pil_image.resize(
            (new_size[1], new_size[0]), resample=resample_method
        )

        # Convert back to numpy array, reshape to add channel dimension
        resized_np = np.array(resized_pil).reshape((new_size[0], new_size[1], 1))
    else:
        # For multi-channel images, process each channel individually
        resized_channels = []

        for c in range(channels):
            channel_data = image_np[:, :, c]
            pil_channel = Image.fromarray(channel_data.astype(np.float32))
            resized_channel = pil_channel.resize(
                (new_size[1], new_size[0]), resample=resample_method
            )
            resized_channels.append(np.array(resized_channel))

        # Stack channels back together
        resized_np = np.stack(resized_channels, axis=2)

    # Convert to mx.array
    return mx.array(resized_np)

Apr 17 '25 15:04 Blaizzy