DiffSynth-Studio When wan-video supports attention, the effect becomes worse.

The current version has updated content: Wan-Video supports multiple Attention implementations. But the effect is far inferior to the branch: wan-train. I tried to use the same picture + the same prompt word + the same parameters many times, but the effect is far inferior to wan-train. What is the reason? Can attention be turned off manually?

Mar 04 '25 07:03 tiga-dudu

so Wan Train produces way more quality you mean? Can you post prompt seed and comparison please?

Mar 04 '25 09:03 FurkanGozukara

so Wan Train produces way more quality you mean? Can you post prompt seed and comparison please?

Hey, because the effect of main is not so good, because I didn't keep its content, but from the generated video, according to the file name I defined, I can probably get some useful information. Video size: 512*512, seed:0, frame number: 97, fps:24 prompt: 这是一辆华丽的古代炮车，炮身装饰精美，细节繁复。炮车有四个轮子，每个轮子上都有复杂的花纹和装饰。炮车正在旋转，显示出其精致的设计和历史感。背景是鲜艳的绿色，突显了炮车的金色和铜色。 main branch:

https://github.com/user-attachments/assets/0b27aa2c-1b91-4736-a9e0-4ea5bdd8b96d

wan-train branch:

https://github.com/user-attachments/assets/d5e4369e-ebdd-4bcd-90c4-9aee045e3776

Mar 04 '25 09:03 tiga-dudu

@tiga-dudu that is some extremely huge different @Artiprocher

@tiga-dudu can you provide me base image i would like to try

Mar 04 '25 09:03 FurkanGozukara

这是一些非常巨大的差异

您能提供我想尝试的基本图像吗

No problem, and I only reasoned for 20 steps, using 480P weights, and I did some logical processing on the input image. The code is as follows, I hope it can help you.

def get_edge_background_color(image, border_size=10):
    # 将图片转换为 numpy 数组
    img_array = np.array(image)
    
    # 提取四个边缘部分
    top_edge = img_array[:border_size, :, :]  # 上边缘
    bottom_edge = img_array[-border_size:, :, :]  # 下边缘
    left_edge = img_array[:, :border_size, :]  # 左边缘
    right_edge = img_array[:, -border_size:, :]  # 右边缘
    
    # 计算每个边缘的平均颜色
    top_avg = np.mean(top_edge, axis=(0, 1))
    bottom_avg = np.mean(bottom_edge, axis=(0, 1))
    left_avg = np.mean(left_edge, axis=(0, 1))
    right_avg = np.mean(right_edge, axis=(0, 1))
    
    # 计算所有边缘的平均颜色
    avg_color = np.mean([top_avg, bottom_avg, left_avg, right_avg], axis=0)
    
    # 返回平均颜色值
    return tuple(avg_color.astype(int))

def pad_to_aspect_ratio(image, target_ratio=(832, 480)):
    img_width, img_height = image.size
    target_w, target_h = target_ratio
    gcd = math.gcd(target_w, target_h)
    aspect_w, aspect_h = target_w // gcd, target_h // gcd  # 计算最简比例
    
    # 计算填充后的尺寸，保证原比例不变
    scale = max(math.ceil(img_width / aspect_w), math.ceil(img_height / aspect_h))
    new_w, new_h = scale * aspect_w, scale * aspect_h
    
    background_color = get_edge_background_color(image)
    print(f'获取背景色: {background_color}')
    print(f'resize img size: ({img_width}, {img_height}) -> ({new_w}, {new_h})')
    
    # 创建新图像并填充
    padded_image = Image.new("RGB", (new_w, new_h), background_color)
    paste_x = (new_w - img_width) // 2
    paste_y = (new_h - img_height) // 2
    padded_image.paste(image, (paste_x, paste_y))
    
    return padded_image

def normal_pad_img(img):
    # 如果是路径则读取
    if isinstance(img, str) and os.path.isfile(img):
        img = Image.open(img).convert("RGB")

    if isinstance(img, Image.Image):
        img_width, img_height = img.size  # 宽 高
        # 宽 >= 高
        if img_width/img_height >= 1.3:
            width, height = 832, 480
            padded_image = pad_to_aspect_ratio(img, (width, height))
        # 高 >= 宽
        elif img_height/img_width >= 1.3:
            width, height = 480, 832
            padded_image = pad_to_aspect_ratio(img, (width, height))
        else:
            width, height = 512, 512
            padded_image = pad_to_aspect_ratio(img, (width, height))
    else:
        padded_image = None
        width, height = 832, 480

    print(f'video size: ({width}, {height})')
    return padded_image, width, height

image, width, height = normal_pad_img(img_path)

Mar 04 '25 09:03 tiga-dudu

@tiga-dudu We noticed the differences. We will debug and fix it.

Mar 04 '25 11:03 Artiprocher

@tiga-dudu We noticed the differences. We will debug and fix it.

awesome looking forward to it thank you so much

Mar 04 '25 12:03 FurkanGozukara

@tiga-dudu We carefully examined the code modifications during this period, and here are the possible reasons:

Custom Attention: You may have installed sage attention, which has been automatically enabled. Please uninstall it using pip uninstall sageattention.
VAE Tile: We modified the VAE tile size, but I don't believe this is the main reason for the differences.
Tokenizer: In the official version of the code, Chinese commas are automatically replaced with English commas. Our earlier version did not have this replacement, but we later aligned it with the official version.
FP8: We found that FP8 quantization occasionally leads to precision overflow issues, so we will modify the default precision in the sample code.

We have checked the code, and if you uninstall sageattention, set tiled=False, ensure that the tokenizer does not include any commas, and use BF16 precision, there will be no differences in the videos generated by both versions.

Mar 04 '25 13:03 Artiprocher

@tiga-dudu We carefully examined the code modifications during this period, and here are the possible reasons:

Custom Attention: You may have installed sage attention, which has been automatically enabled. Please uninstall it using .pip uninstall sageattention

VAE Tile: We modified the VAE tile size, but I don't believe this is the main reason for the differences.

Tokenizer: In the official version of the code, Chinese commas are automatically replaced with English commas. Our earlier version did not have this replacement, but we later aligned it with the official version.

FP8: We found that FP8 quantization occasionally leads to precision overflow issues, so we will modify the default precision in the sample code.

We have checked the code, and if you uninstall sageattention, set tiled=False, ensure that the tokenizer does not include any commas, and use BF16 precision, there will be no differences in the videos generated by both versions.

Hey, sorry for just replying to you. I checked my environment and I didn't install sageattention. At the same time, I remember that when I installed diffSynth, I successfully installed two other packages at the same time, but I forgot what they were called. But it was definitely not sageattention, because sageattention requires cuda12.4+, but my server only has chuda12.1.

Mar 07 '25 09:03 tiga-dudu