nanosvg Rasterizer SSE2 optimization

I tried SSE2 version of nsvg__scanlineSolid() with NSVG_PAINT_COLOR code path converted. Benchmark on my i5 661 @ 3.5GHz, Windows 7 x64, Visual Studio 2015 RC, x86 release target. Rendering Ghostscript_Tiger.svg, measuring nsvgRasterize() time.

Upstream NanoSVG 900x900px: 68ms 9000x9000px: 4256ms

SSE2 NanoSVG 900x900px: 60ms 9000x9000px: 3125ms

Broken nsvg__scanlineSolid NanoSVG 900x900px: 44ms 9000x9000px: 1895ms Note: this version does nothing in nsvg__scanlineSolid(), just return. Output is just an empty rectangle.

Some improvement, but nothing stellar. I didn't use SSE before so maybe someone experienced could do better. Anyone interested in my quick&dirty patch? Output PNG is binary same for both upstream and SSE2 versions.

Streaming SIMD Extensions (/arch:SSE) option was enabled for whole application. There is another boost with Streaming SIMD Extensions 2 (/arch:SSE2) enabled, but there are still (AMD) CPUs not supporting SSE2 in old computers.

Jun 14 '15 10:06 jry2

Nice! Have you checked on higher level how much time is spent in flattenPath, qsort, and rasterize sorted edges? I expect the rasterization to dominate, but just curious. Also, what is the proportion of nsvg__scanlineSolid of nsvg__rasterizeSortedEdges?

Jun 14 '15 12:06 memononen

Yes, see attached screenshots from release (upstream) build. Rendering 9000x9000px. nsvg__unpremultiplyAlpha() is another SSE2 candidate.

tiger_release_profiler1 tiger_release_profiler2 tiger_release_profiler3

Jun 14 '15 12:06 jry2

Same options but rendering to 900x900 target.

tiger_900_1

Jun 14 '15 12:06 jry2

You should do it in NEON! What does your patch look like?

Jun 15 '15 05:06 bengarney

I'm working on x86/x64 project for Windows so ARM-NEON would not help. I will publish my patch.

Jun 15 '15 06:06 jry2

Commit: https://github.com/jry2/nanosvg/commit/20db7eb52c728d3898dc1fa20089a8f28c2d4e60

Jun 15 '15 12:06 jry2

Another benchmark (Ghostscript_Tiger.svg rendered 9000x9000px), tested x86 vs x64 performance.

Upstream version x86: 4120ms, x64: 2960ms

SSE2 version x86: 3100ms, x64: 2270ms

Edit: there is something fishy with x86 / x64 builds. Difference is in nsvg__fillActiveEdges: 861ms for x86 build vs 70ms for x64 build. Binary output is different too.

x86 x86

x64 x64

Edit2: OK, nothing fishy, just another example of SSE optimization. It turned out the x64 version nsvg__fillScanline is optimized with SSE instructions while x86 version is not. I have SSE optimization enabled on app level in compiler. Difference is mentioned ~800ms.

Different output from x86 / x64 builds could be related to http://stackoverflow.com/questions/22710272/difference-in-floating-point-arithmetics-between-x86-and-x64. There are only small differences, in most cases just about one. I didn't investigate this one.

Jun 15 '15 18:06 jry2

this seems like a nice optimization ever consider creating a pull request to get this merged?

Oct 30 '20 12:10 james2432

you should do it with intel ispc

Nov 05 '22 03:11 DsoTsin

nanosvg nanosvg copied to clipboard

Rasterizer SSE2 optimization

nanosvg
nanosvg copied to clipboard