nanosvg
nanosvg copied to clipboard
Rasterizer SSE2 optimization
I tried SSE2 version of nsvg__scanlineSolid()
with NSVG_PAINT_COLOR
code path converted. Benchmark on my i5 661 @ 3.5GHz, Windows 7 x64, Visual Studio 2015 RC, x86 release target. Rendering Ghostscript_Tiger.svg, measuring nsvgRasterize()
time.
Upstream NanoSVG 900x900px: 68ms 9000x9000px: 4256ms
SSE2 NanoSVG 900x900px: 60ms 9000x9000px: 3125ms
Broken nsvg__scanlineSolid NanoSVG
900x900px: 44ms
9000x9000px: 1895ms
Note: this version does nothing in nsvg__scanlineSolid()
, just return. Output is just an empty rectangle.
Some improvement, but nothing stellar. I didn't use SSE before so maybe someone experienced could do better. Anyone interested in my quick&dirty patch? Output PNG is binary same for both upstream and SSE2 versions.
Streaming SIMD Extensions (/arch:SSE) option was enabled for whole application. There is another boost with Streaming SIMD Extensions 2 (/arch:SSE2) enabled, but there are still (AMD) CPUs not supporting SSE2 in old computers.
Nice! Have you checked on higher level how much time is spent in flattenPath, qsort, and rasterize sorted edges? I expect the rasterization to dominate, but just curious. Also, what is the proportion of nsvg__scanlineSolid of nsvg__rasterizeSortedEdges?
Yes, see attached screenshots from release (upstream) build. Rendering 9000x9000px.
nsvg__unpremultiplyAlpha()
is another SSE2 candidate.
Same options but rendering to 900x900 target.
You should do it in NEON! What does your patch look like?
I'm working on x86/x64 project for Windows so ARM-NEON would not help. I will publish my patch.
Commit: https://github.com/jry2/nanosvg/commit/20db7eb52c728d3898dc1fa20089a8f28c2d4e60
Another benchmark (Ghostscript_Tiger.svg rendered 9000x9000px), tested x86 vs x64 performance.
Upstream version x86: 4120ms, x64: 2960ms
SSE2 version x86: 3100ms, x64: 2270ms
Edit: there is something fishy with x86 / x64 builds. Difference is in nsvg__fillActiveEdges
: 861ms for x86 build vs 70ms for x64 build. Binary output is different too.
x86
x64
Edit2: OK, nothing fishy, just another example of SSE optimization. It turned out the x64 version nsvg__fillScanline
is optimized with SSE instructions while x86 version is not. I have SSE optimization enabled on app level in compiler. Difference is mentioned ~800ms.
Different output from x86 / x64 builds could be related to http://stackoverflow.com/questions/22710272/difference-in-floating-point-arithmetics-between-x86-and-x64. There are only small differences, in most cases just about one. I didn't investigate this one.
this seems like a nice optimization ever consider creating a pull request to get this merged?
you should do it with intel ispc