VapourSynth-BM3D RGB2OPP is SLOW!

RGB2OPP is SLOW!

Open mysteryx93 opened this issue 2 years ago • 8 comments

I've been running benchmark tests on my script (on 5K video clip)

BM3D                 parallel       167.75     136.73
Bicubic              parallel        41.23      33.61
RGB2OPP              parallel        31.54      25.70
VAggregate           parallel        31.12      25.36
Degrain3             parallel        30.58      24.93
KNLMeansCL           parreq          27.31      22.26

I wouldn't expect RGB2OPP to be way up there in the list! Above KNLMeansCL and above SMDegrain. Why is it so damn slow?

OPP gives quality gain, but when the entire script runs at .32fps instead of .44fps only for converting from YUV to RGB/OPP, I could set analysis settings higher instead.

Oct 17 '21 18:10 mysteryx93

RGB2OPP and OPP2RGB are mainly developed for the plugin to work on its own, but they are not very well optimized. As in mvsfunc, mvf.BM3D uses mvf.ToYUV(matrix='OPP') and mvf.ToRGB(matrix='OPP') for the conversion. They call fmtc.matrix to do the job.

Oct 18 '21 07:10 mawen1250

BTW, I have a question about the OPP colorspace: https://github.com/HomeOfVapourSynthEvolution/VapourSynth-BM3D/blob/master/include/Specification.h#L176-L192

It seems it is almost identical to YCgCo (diff: signs and chroma components swapped, different weights for Y), opposing red vs. blue and green vs. magenta, whereas usual opponent colors are more like red vs. green and blue vs. yellow. Other OPP definitions found on the web are consistent with R-G, R+G-2*B. It looks like the G and B components have been swapped in BM3D. I don’t know if it is on purpose or unintentional.

Oct 18 '21 09:10 EleonoreMizo

OT: There is no binary for r9 ^_^ EDIT: Sorry i should read more carefully!!!

Oct 18 '21 12:10 theChaosCoder

Interestingly enough, using FMTC is even slightly slower than using RGB2OPP. Script doing simple YUV-RGB-OPP conversion back-and-forth with RGB2OPP on 5K clip gives 9fps, and with FMTC, 8fps.

Oct 19 '21 18:10 mysteryx93

Here. These functions are 7-12x faster than the other methods. Thanks to Godway.

def RGB_to_OPP (c: vs.VideoNode, fulls: bool = False) -> vs.VideoNode:
    if c.format.color_family != vs.RGB:
        raise TypeError("RGB_to_YCgCoR: Clip is not in RGB format!")

    bd = c.format.bits_per_sample
    R = core.std.ShufflePlanes(c, [0], vs.GRAY)
    G = core.std.ShufflePlanes(c, [1], vs.GRAY)
    B = core.std.ShufflePlanes(c, [2], vs.GRAY)

    b32 = "" if bd == 32 else "range_half +"

    O  = core.akarin.Expr([R, G, B], ex_dlut("x y z + + 0.333333333 *",     bd, fulls))
    P1 = core.akarin.Expr([R,    B], ex_dlut("x y - 0.5 * "+b32,            bd, fulls))
    P2 = core.akarin.Expr([R, G, B], ex_dlut("x z + 0.25 * y 0.5 * - "+b32, bd, fulls))

    return core.std.ShufflePlanes([O, P1, P2], [0, 0, 0], vs.YUV)


def OPP_to_RGB (c: vs.VideoNode, fulls: bool = False):
    if c.format.color_family != vs.YUV:
        raise TypeError("YCgCoR_to_RGB: Clip is not in YUV format!")

    bd = c.format.bits_per_sample
    O = core.std.ShufflePlanes(c, [0], vs.GRAY)
    P1 = core.std.ShufflePlanes(c, [1], vs.GRAY)
    P2 = core.std.ShufflePlanes(c, [2], vs.GRAY)

    b32 = "" if bd == 32 else "range_half -"

    R = core.akarin.Expr([O, P1, P2], ex_dlut("x y "+b32+" + z "+b32+" 0.666666666 * +", bd, fulls))
    G = core.akarin.Expr([O,     P2], ex_dlut("x y "+b32+" 1.333333333 * -",             bd, fulls))
    B = core.akarin.Expr([O, P1, P2], ex_dlut("x z "+b32+" 0.666666666 * + y "+b32+" -", bd, fulls))

    return core.std.ShufflePlanes([R, G, B], [0, 0, 0], vs.RGB)

# HBD constants 3D look up table
#
# * YUV and RGB mid-grey is 127.5 (rounded to 128) for PC range levels,
#   this translates to a value of 125.5 in TV range levels. Chroma is always centered, so 128 regardless.
def ex_dlut(expr: str = "", bits: int = 8, fulls: bool = False) -> str:
    bitd = \
        0 if bits == 8 else \
        1 if bits == 10 else \
        2 if bits == 12 else \
        3 if bits == 14 else \
        4 if bits == 16 else \
        5 if bits == 24 else \
        6 if bits == 32 else -1
    if bitd < 0:
        raise ValueError(f"ex_dlut: Unsupported bit depth ({bits})")
    
    #                 8-bit UINT      10-bit UINT          12-bit UINT          14-bit UINT            16-bit UINT         24-bit UINT               32-bit Ufloat
    range_min   = [  (  0.,  0.),    (   0.,   0.   ),    (   0.,   0.   ),    (    0.,    0.   ),    (    0.,    0.),    (       0.,       0.),    (       0.,       0.)   ]   [bitd]
    ymin        = [  ( 16., 16.),    (  64.,  64.   ),    ( 256., 257.   ),    ( 1024., 1028.   ),    ( 4096., 4112.),    ( 1048576., 1052672.),    (  16/255.,  16/255.)   ]   [bitd]
    cmin        = [  ( 16., 16.),    (  64.,  64.   ),    ( 256., 257.   ),    ( 1024., 1028.   ),    ( 4096., 4112.),    ( 1048576., 1052672.),    (  16/255.,  16/255.)   ]   [bitd]
    ygrey       = [  (126.,126.),    ( 502., 504.   ),    (2008.,2016.   ),    ( 8032., 8063.   ),    (32128.,32254.),    ( 8224768., 8256896.),    ( 125.5/255.,125.5/255.)]   [bitd]
    range_half  = [  (128.,128.),    ( 512., 514.   ),    (2048.,2056.   ),    ( 8192., 8224.   ),    (32768.,32896.),    ( 8388608., 8421376.),    ( 128/255., 128/255.)   ]   [bitd]
    yrange      = [  (219.,219.),    ( 876., 879.   ),    (3504.,3517.688),    (14016.,14070.750),    (56064.,56283.),    (14352384.,14408448.),    ( 219/255., 219/255.)   ]   [bitd]
    crange      = [  (224.,224.),    ( 896., 899.500),    (3584.,3598.   ),    (14336.,14392.   ),    (57344.,57568.),    (14680064.,14737408.),    ( 224/255., 224/255.)   ]   [bitd]
    ymax        = [  (235.,235.),    ( 940., 943.672),    (3760.,3774.688),    (15040.,15098.750),    (60160.,60395.),    (15400960.,15461120.),    ( 235/255., 235/255.)   ]   [bitd]
    cmax        = [  (240.,240.),    ( 960., 963.750),    (3840.,3855.   ),    (15360.,15420.   ),    (61440.,61680.),    (15728640.,15790080.),    ( 240/255., 240/255.)   ]   [bitd]
    range_max   = [  (255.,255.),    (1020.,1023.984),    (4080.,4095.938),    (16320.,16383.750),    (65280.,65535.),    (16711680.,16776960.),    (       1.,       1.)   ]   [bitd]
    range_size  = [  (256.,256.),    (1024.,1024.   ),    (4096.,4096.   ),    (16384.,16384.   ),    (65536.,65536.),    (16777216.,16777216.),    (       1.,       1.)   ]   [bitd]

    fs  = 1 if fulls else 0
    expr = expr.replace("ymax ymin - range_max /", str(yrange[fs]/range_max[fs]))
    expr = expr.replace("cmax cmin - range_max /", str(crange[fs]/range_max[fs]))
    expr = expr.replace("cmax ymin - range_max /", str(crange[fs]/range_max[fs]))
    expr = expr.replace("range_max ymax ymin - /", str(range_max[fs]/yrange[fs]))
    expr = expr.replace("range_max cmax cmin - /", str(range_max[fs]/crange[fs]))
    expr = expr.replace("range_max cmax ymin - /", str(range_max[fs]/crange[fs]))
    expr = expr.replace("ymax ymin -",             str(yrange[fs]))
    expr = expr.replace("cmax ymin -",             str(crange[fs]))
    expr = expr.replace("cmax cmin -",             str(crange[fs]))

    expr = expr.replace("ygrey",                   str(ygrey[fs]))
    expr = expr.replace("ymax",                    str(ymax[fs]))
    expr = expr.replace("cmax",                    str(cmax[fs]))
    expr = expr.replace("ymin",                    str(ymin[fs]))
    expr = expr.replace("cmin",                    str(cmin[fs]))
    expr = expr.replace("range_min",               str(range_min[fs]))
    expr = expr.replace("range_half",              str(range_half[fs]))
    expr = expr.replace("range_max",               str(range_max[fs]))
    expr = expr.replace("range_size",              str(range_size[fs]))
    return expr

Oct 22 '21 01:10 mysteryx93

Interestingly enough, using FMTC is even slightly slower than using RGB2OPP. Script doing simple YUV-RGB-OPP conversion back-and-forth with RGB2OPP on 5K clip gives 9fps, and with FMTC, 8fps.

That's weird. DId you do the conversion in FP32 precision? I suppose FMTC is more optimized under INT16.

Oct 22 '21 08:10 mawen1250

ex_dlut

Nice work! I'd try it if I get the time. BTW, if the source is YUV, computing a matrix to do the transform between YUV and OPP directly will furthur speed it up.

Oct 22 '21 08:10 mawen1250

Yes... but I don't know anyone who knows the math to do it

Oct 22 '21 16:10 mysteryx93

VapourSynth-BM3D VapourSynth-BM3D copied to clipboard

RGB2OPP is SLOW!

VapourSynth-BM3D
VapourSynth-BM3D copied to clipboard