GPipe-Core Possible performance issue (Kleisli laziness?) with ToBuffer Arrows

I have a pretty simple 2D sprite engine-type test program using GPipe that just allocates a vertex Buffer and writes about 30,000 vertices to it (10,000 triangles, arrange as 5,000 quads), with each vertex having a 2D position, a color, UV coordinates, and a 32-bit integer depth value (normalized). Every frame I write out my sprites to this buffer as a set of triangles. The vertices of the triangles use a custom buffer format so I have control over the ToBuffer instance, but they are represented on the host just as a 4-tuple of the elements I gave above.

-- 2D position, color, UV, depth
newtype VertBufferFormat = VertBufferFormat { unVertBufferFormat :: (B2 Float, B4 Float, B2 Float, Normalized (B Word32)) }

instance BufferFormat VertBufferFormat where
  type HostFormat VertBufferFormat = (V2 Float, V4 Float, V2 Float, Word32)
  {-# INLINE toBuffer #-}
  toBuffer = arr (\ ~(a,b,c,d) -> (a, (b,c,d))) >>>
             first toBuffer >>>
             arr (\(a', (b,c,d)) -> (b, (a',c,d))) >>>
             first toBuffer >>>
             arr (\(b', (a',c,d)) -> (c, (a',b',d))) >>>
             first toBuffer >>>
             arr (\(c', (a',b',d)) -> (d, (a',b',c'))) >>>
             first toBuffer >>>
             arr (\(d', (a',b',c')) -> VertBufferFormat (a',b',c',d'))

The main loop in my testing is super simple: check the buffer is large enough for the triangles I'm rendering, and resize if necessary (but for these tests I use a static triangle list and a large-enough pre-allocated buffer, so that code only runs on the first frame), write out the vertices to the buffer with writeBuffer, and render them to the screen with a simple 2d, unlit, textured shader.

While profiling this program, I find that 91% of my program time is in my toBuffer instance (being called from makeBuffer.writer), with 87% of that time is spent in (.) for the ToBuffer Category instance (which as I understand, just wraps 3 instances of the Kleisli (.) for the 3 Arrows in the ToBuffer datatype, which seem to be (as best as I can follow) simple computations for 1) computing size and padding for buffer elements 2) getting a b type element out of a Buffer os b list from a given index and stride? and 3) poking buffer elements to memory). The (.) function for ToBuffer is almost 50% "individual" time in my profile trace, with the rest of the "inherited" time being taken up mostly by the toBufferBUnaligned.writer function (which makes sense, as that seems to be the base function that actually pokes the buffer Ptr) and the ToBuffer first instance (which also kind of makes sense I guess, as there is some actually computation done in the Kleisli first function).

I've already improved performance by writing my own ToBuffer instance for my custom format type (I found I got improvements by both avoiding proc notation and the recursively defined tuple ToBuffer instances), but it still seems like there is too much overhead there for simply writing out these buffers. The confusing bit to me is that I'm not sure if the problem is laziness in building up the lambdas in the (.) for Kleisli? I imagine each arr application in the custom ToBuffer instance above adding at least 1 additional layer, and the base toBufferBUnaligned returned Kleisli arrows composing several computations. I can't imagine that (.) is actually taking up very much time on its own, so I can only imagine that it's forcing a huge lazy thunk generated somewhere else (which is why I thought of the Kleisli compositions here). The list of triangles I am rendering is a simple static list that is the same from frame to frame, so I doubt there is any computation being forced there.

Another slightly confusing thing is that the (.) function has an "entries" count of 0 in my profiler trace, but I think maybe this is to do with inlining? I couldn't find a satisfying explanation for that after doing a bunch of googling. The individual writer instances have the correct number of entries if I multiply through the number of buffer items, buffer size, triangles rendered and frames. (I'll paste the relevant profiler trace sections here. )

I was hoping you could let me know if I was on the right track here, especially with regards to the GPipe internals. It seems to me like this is way to much overhead just for copying these elements out to the buffer (when really all the ToBuffer Arrows internally seem to be tracking is the current offset/stride). I would hope that doing that simply in a raw loop in Haskell would have much better performance, but I'm not sure where to start looking for performance improvements, or places to start inlining more stuff (and a lot of the Category/Arrow instances in the Haskell libraries don't seem to be marked INLINE or INLINABLE). Let me know if you want me to push the whole project to Github. Hopefully this will just turn out to be something stupid I've done elsewhere!

Nov 23 '17 19:11 bavis-m

Hi,

Great investigation, seems that you have grasped quite a lot of it! TBH, I havent had a test where I have been pushing much data into buffers each frame and thus havent optimized this. I think this has to do with too much strictness in the ToBuffer type (defined in Graphics.GPipe.Internal.Buffer). The idea was to have only one part of that type (the writer) being called for each element of a buffer, while the others are only used once per buffer. But I think my overuse of strictness annotations have made all Klesli arrows in that type be evaluated.

Could you try removing the ! on all members but the 3rd in that type and try again? Or even on the 3rd as well?

Dec 06 '17 22:12 tobbebex

Removing the strictness annotations does nothing. I haven't looked into what exactly is slow there, but just looking at the code, it seems way too much abstraction for writing bytes into a buffer. I'll look a bit closer at some point.

Apr 04 '21 00:04 pippijn

GPipe-Core GPipe-Core copied to clipboard

Possible performance issue (Kleisli laziness?) with ToBuffer Arrows

GPipe-Core
GPipe-Core copied to clipboard