WLED icon indicating copy to clipboard operation
WLED copied to clipboard

Speed improvements for discussion

Open DedeHai opened this issue 1 year ago • 35 comments

Added a lot of improvements to core functions making them slightly faster and saving some flash. I tested all functions to work the same as before. I tested speed by looking at FPS and it's hard to say how much faster it actually is but I see some improvement, maybe 1%. The inline of getMappedPixelIndex() is just an idea, not sure that is correct nor do I know if it improves anything. Did not test on ESP8266.

DedeHai avatar Sep 12 '24 19:09 DedeHai

one more thing I was thinking about: the way I do color handling in the particle system is very efficient by creating a buffer of the segment size, rendering to that and then passing it to WLED. it would be even more efficient if I would drop rendering 'mirrored' pixels (may implement that at one point). So my idea was this: create a buffer (RGB, I don't think white is needed for FX right?) that is the size of the largest segment (and stays in RAM so no fragmentation unless segments are changed a lot, not sure how buffers work currently). Render one segment to that buffer (clear to zero whatever the segment needs), the buffer is accessed as 'in order' so no mapping (except mirror and transpose) and once it is rendered (including blurring and stuff) transfer the buffer to the 'strip' using mapping. Then clear the buffer, do the next segment and so on. This would also allow to use 'add pixel color' when a buffer is transferred, making overlays play more nicely, independent of FX supporting it. If a segment FX calls 'get pixel color' it would still have to be fetched from the strip buffer though. but blurring would be a lot faster and overlapping segments could be 'transparent'. This is only a half-baked idea as I do not yet fully grasp how the whole 'color chain' works from segment to strip (and back) but the whole 'mapping' checks could be cut down significantly if it is not done for each 'setPixelColor' (at least I think so)

DedeHai avatar Sep 13 '24 16:09 DedeHai

the buffer is accessed as 'in order' so no mapping (except mirror and transpose) and once it is rendered (including blurring and stuff)

I had the same idea! I expect that using flat temporary render buffer of the virtual() sizes, then applying all mapping operations as a post-processing step, would yield a substantial average-case performance improvement. That said, if you try to do mirror and transpose during the render, I think you'll give up most of the benefits as every write still ends up being a pile of variable lookups and conditionals instead of just straight local pointer arithmetic -- I think it'd be better to perform all mapping operations after the render. We'd also get some advantages with code layout and IRAM caching, as each FX function wouldn't need to make so many external function calls.

The big tradeoff is RAM usage. The catch is that the temporary buffer needs contiguous RAM. If it's held allocated, we lose the ability to share that memory (in highly constrained systems) with other tasks like the web server... but if it's not, we risk not being able to get a big enough buffer when it's needed for a render. There may also be systems where there just isn't enough RAM to hold two copies of every pixel state at once.

That said, I'd still love to see a PR that implements a temporary render buffer and pulls the mapping operation out to a separate pass after the FX function call. I think the performance and code size improvements are likely to be compelling, even if it reduces the maximum LED count that can be supported on memory constrained hardware.

willmmiles avatar Sep 13 '24 17:09 willmmiles

How are buffers currently? What is the global buffer vs no global buffer?

DedeHai avatar Sep 13 '24 22:09 DedeHai

Let's start at the bottom and make our way up.

NeoPixelBus (in 2.7.9) uses 3 buffer: 2 render buffers and 1 transmit buffer. Render buffers are used in "double buffer" fashion. Transmit buffer may be small, but render buffers are of size LED count * pixel size where pixel size can be 24bits up to 64bits. (2.8.0 or later may change that as @makuna is working on optimisations). Different hardware (RMT, I2S, ...) may have transmit buffers of different size.

WLED creates its own global buffer of size LED count * 32bit (if so selected in settings) for its strip. This is due to the fact that WLED uses NeoPixelBusLg which does luminance adjustments after each SetPixelColor(). This means that GetPixelColor() will not return original color that SetPixelColor() wrote if brightness is less than 255. If you disable global buffer, WLED tries to reconstruct original color but that is not always possible without loss.

Segment does not create any buffers of its own (though in the past we had setUpLeds() that did that). It will utilize strip's buffering when getPixelColor() is called.

This (lossy reconstruction) is why mirroring, reversing and transposing have to be done in the setPixelColor() instead of separately when FX is done writing (BTW each pixel is written only once in FX if the FX function is written correctly).

Newer versions of NPB are going to do brightness scaling only during transmission of data to LEDs so GetPixelColor() will return originally set value. Until that happens we need to deal with it ourselves. On top of that NPB uses the same bit depth for transmit and render buffers but a future version is scheduled to have transmit buffer of the same bit depth as required by LEDs while render buffers may be of bit depth required by user (in our case 32bits).

And then, there is a CCT support.

NOTE: Note the difference between 'GetPixelColor()andgetPixelColor()`. One is from NBP the other from WLED.

blazoncek avatar Sep 14 '24 06:09 blazoncek

Thank you for detailing this, much clearer now.

I was just looking at 'soap' FX as an example that uses setPixelColorXY() as well as getPixelColorXY() but the following points apply to most FX. The reason I see this is quite "slow":

  • colors are handled in CRGB, each getPixelColorXY() is converted from 32bit color, manipulated, converted back in setPixelColorX()Y -> native 32bit handling would improve that i.e. ditch fastled and implement the used functions in CRGBW32, requires a lot of code refactoring though. Speed improvement is probably not going to be huge as conversion is fast (but in principle mostly unnessecary).
  • the global buffer is in 'strip space mapping' so each call to set or get is mapped to and from there, including mirror, reverse, grouping etc. plus brightness adustment in setPixelColorXY() plus the LED map. This is what mostly makes it slow.

A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping). The main draw-back would be a lot more memory is used if segments are overlapping. This could be mitigated by setting a memory limit (like it is done with SEGMENT.data) and if that limit is exceeded, fall back to not using the buffers (this is how I do it in the ParticleSystem, btw: if I disable local buffers, PS Fire for example drops from ~85FPS to ~55FPS).

Would that be a good way to try and solve it or is there something fundamentally flawed with this approach?

DedeHai avatar Sep 14 '24 11:09 DedeHai

* colors are handled in CRGB, each `getPixelColorXY()` is converted from 32bit color, manipulated, converted back in `setPixelColorX()Y`   -> native 32bit handling would improve that i.e. ditch fastled and implement the used functions in CRGBW32, requires a lot of code refactoring though. Speed improvement is probably not going to be huge as conversion is fast (but in principle mostly unnessecary).

I am not going to rewrite all of the effects. 😄

* the global buffer is in 'strip space mapping' so each call to set or get is mapped to and from there, including mirror, reverse, grouping etc. plus brightness adustment in `setPixelColorXY()` plus the LED map. This is what mostly makes it slow.

strip only does mapping via ledmaps. Swapping index if necessary. No real slowness there, at least there was none until someone wanted to exclude ledmaps while realtime data was received. BusDigital::getPixelColor() uses global buffer instead of querying NeoPixelBus.

A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping). The main draw-back would be a lot more memory is used if segments are overlapping. This could be mitigated by setting a memory limit (like it is done with SEGMENT.data) and if that limit is exceeded, fall back to not using the buffers (this is how I do it in the ParticleSystem, btw: if I disable local buffers, PS Fire for example drops from ~85FPS to ~55FPS).

This was used in first iteration of 0.14 while we used setUpLeds() (MM still does). It presented a whole lot of problems which were mitigated using global buffer (which presented a new set of issues, though).

blazoncek avatar Sep 14 '24 11:09 blazoncek

I am not going to rewrite all of the effects. 😄 no one is asking you to. I am imagining to clone most of the CRGB struct/methods so it CRGBW32 could be used in the exact same way, if done right, it could be a 1:1 replacement, needing update on all FX but it should not be too complex (mostly a direct find and replace, maybe even a #define to override it for starts). strip only does mapping via ledmaps. Swapping index if necessary. No real slowness there, at least there was none until someone wanted to exclude ledmaps while realtime data was received.

true, lookup is fast.

BusDigital::getPixelColor() uses global buffer instead of querying NeoPixelBus.

yes, the main difference being the color correction, right? If I disable global buffer, colors get worse but speed stays (almost) the same.

This was used in first iteration of 0.14 while we used setUpLeds() (MM still does). It presented a whole lot of problems which were mitigated using global buffer

Do you mind to quickly elaborate what problems these were?

DedeHai avatar Sep 14 '24 11:09 DedeHai

Do you mind to quickly elaborate what problems these were?

Overlapping segments.

blazoncek avatar Sep 14 '24 12:09 blazoncek

Do you mind to quickly elaborate what problems these were?

Overlapping segments.

though so ;) but glad it is 'only that' overlapping segments is a problem but mostly due to additional RAM usage, it solves other problems overlapping segments have in a global buffer, namely overlapping being FX dependent. When transferring the segments, color adding could be used. it is somewhat slower though but if the target buffer would be 32bit it would only affect overlapping segments (as adding to black is just one additional if(c!=0) )

DedeHai avatar Sep 14 '24 13:09 DedeHai

but glad it is 'only that'

It is not. wait for "blending styles"

blazoncek avatar Sep 14 '24 16:09 blazoncek

future improvements to buffers aside: should i squash this draft and make a PR? any changes required before I do that?

DedeHai avatar Sep 15 '24 08:09 DedeHai

Please, and do address points discussed.

blazoncek avatar Sep 15 '24 12:09 blazoncek

@blazoncek one more general question: currently, all color calculations are done with 32bit colors, even though FX only use 24bit (CRGB). For color manipulation, CRGB would be faster (scaling, adding, blurring etc.). In what scenarios is the white channel used and how (and at what point) is its value calculated? Is it all done in 32bit to make it future-proof or why was this approach chosen over 'calculate white channel as a last step'? I am working on a proof of concept with a CRGB equivalent in 32bit (CRGBW) which would ease 32bit color handling. At least I think it makes handling easier: less overloaded functions needed, the compiler will take care of conversions.

DedeHai avatar Sep 16 '24 07:09 DedeHai

One thing to keep in mind regarding existing effects is that most were ported from FastLED type effects. Some taken directly from FastLED, some from places like soulmate.com.

FastLED operated on its CRGB (or CHSV) and so most effects disregarded the W channel. I guess this was the reason to introduce "Automatic White Calculation" for strips like SK6812 where white channel could be used to reduce power draw.

Any new effect should, if possible, take into account white channel as well IMO though most will really not benefit at all. It all comes down to the "artist" how he/she envisions an effect on RGB and RGBW strips.

I do not know if stripping W from calculations is wise or not without prior notice to users.

blazoncek avatar Sep 16 '24 08:09 blazoncek

I do not know if stripping W from calculations is wise or not without prior notice to users.

Agreed. These are just ideas. I don't want to get too deep in this topic but RGB already contains white, hence my question how and where it is extracted to be a separate channel. To me it makes little sense to treat is seperately in effects: yes there may be some artistic use on RGBW strips for that but it could also be done in RGB. Or how do RGB strips treat the seperated white if I would set it non zero in an effect, expecting it to turn out white? Also: with the CRGBW struct (or is it a class?) it may be possible to do both with minimal overhead. If a strip is RGBW, the buffer gets allocated as CRGBW, if no seperate white channel: RGB. Not sure this is doable in a easy way but that would get the best of both worlds.

DedeHai avatar Sep 16 '24 11:09 DedeHai

I would avoid CRGB. Why? Because WLED uses NeoPixelBus and not FastLED (well, palettes are exception). NeoPixelBus has its own classes comparable to CRGB but also extended to include CCT (or WW & CW) information which does not exist in FastLED. So, to avoid confusion work with uint32_t if possible for now.

Unfortunately the world is not as simple as that. CRGB is very common and a lot of people know what it represents. To top that some people want to control white channel independently of effects.

blazoncek avatar Sep 16 '24 12:09 blazoncek

hmm, a complex topic with no 'one fits all' solution. All I can say: if we want speedy effects, dragging (unused) white channel along takes a big toll. I will take a look at NeoPixelBus and what it can do, sounds interesting.

DedeHai avatar Sep 16 '24 14:09 DedeHai

I think the biggest problem may be unnecessary conversions (a lot of them) as there is no internal standard how to work with colors. On its own W channel is not that of a problem.

blazoncek avatar Sep 16 '24 14:09 blazoncek

Some thoughts on RGB+W (including WW and even WWW) from the NPB perspective: The white output on the chips varies by the wavelength and power. Some are cold, some neutral, some warm, and even some that are way brighter than the RGB at full bright. There are requests to support having a white color component (temperature) that will automatically blend through the white colors. But this is a LED by LED specific process. Not sure how an effect would do it consistently. My current thinking on this detail is conversion is done right as gamma and luminance are applied. In my library this is part of the "shader" for the NeoPixelBusLg and should be done only once. I have not ratified this into a new color object RGBWT (RGB + White + temperature) yet as this still doesn't define what the brightness comparison is between the RGB and the W channel(s).
The nice thing about using the uint32_t (32 bit object) is that it is the word size for ESP32, so optimized. While having white in an effect is interesting, having an alpha channel will allow some improved blending. Just thoughts.

Makuna avatar Sep 16 '24 16:09 Makuna

Thank you @Makuna for your thoughts.

Not sure how an effect would do it consistently.

Most likely it can't. Not in the way currently things are implemented.

as this still doesn't define what the brightness comparison is between the RGB and the W channel(s)

I would recommend you leave initial W channel scaling to the user. Just do L and G as you do for RGB.

having an alpha channel will allow some improved blending.

Now this is indeed a good idea. Some other folks will like it if we move SEGMENT.opacity into W channel and then create Segment::setPixelColorWithAlpha() or use something like SEGMENT.setPixelColor(i, color_blend(c, SEGMENT.getPixelColor(i), W(c))). Currently SEGMENT.opacity just reduces brightness of all channels. Unfortunately that will mean we'll need to come up with something if user selects manual (or dual) white management as that separates RGB and W channels logically (effect function needs to set both RGB and W information).

blazoncek avatar Sep 17 '24 05:09 blazoncek

This was used in first iteration of 0.14 while we used setUpLeds() (MM still does). It presented a whole lot of problems which were mitigated using global buffer (which presented a new set of issues, though).

@blazoncek I fully agree. We still have segment level buffers in the MM fork, but it really adds complications. Actually, the new upstream solution with buffering on strip or busses level makes more sense. No matter where the buffer is placed, it can guarantee lossless getPixelColorXY().

From my own experiments, buffering on segment level is fastest, but that's due to the complicated logic when going down the chain from segment -> strip -> busses -> neoPixelBus etc. Once the licensing issues (MIT vs. GPLv3) are solved between our forks, I can start bringing some optimizations to upstream that were developed by @troyhacks and myself, to speed up pixel drawing. In MM, we can achieve 96 FPS (constant, not peak) on a 64x64 pixel fixture (HUB75).

softhack007 avatar Sep 17 '24 16:09 softhack007

overlapping segments is a problem but mostly due to additional RAM usage, it solves other problems overlapping segments have in a global buffer, namely overlapping being FX dependent.

Well, overlapping segments has been a challenge for years. If you have segment level (instead of global) buffering, effects do not influence each other any more, but you can never be sure which segment will "win" the race for a physical pixel --> flickering and black blocks. The "overlay" option added in some effects solves flickering problems, but its still a half-hearted thing.

Sometimes - for creativity reasons - you even want a global buffer. It allows for some neat creations like "akemi of hell" (HB effects ), where the "waverly" effect is drawn on top of akemi as a background.

I have thought about solutions, but still could not imagine (yet) how to have "real" overlays in WLED. You would need a few things like:

  • segment layers, or at least a notion of priority for effects. So that the overlay effect is always painted as the last one.
  • a clear definition of "background color". Because when drawing overlays, the background color must be considered as transparent. Background is not always "black" but could be any color the user likes to have.
  • ....

softhack007 avatar Sep 17 '24 16:09 softhack007

A thought.

If we remove W channel from FX functions (i.e. always render all effects in RGB only) and use W channel as a last step of painting pixels (if user has "manual" or "dual" auto white calculation), that could work but may need some tweaking in setPixelColor() or show(). In such case W part of color can be used as alpha channel and will enable effective blending of segments (or layers if you wish). That would of course mean FX function needs to be aware of this (which is not ATM) so may need a rewrite.

That is something I consider worth exploring.

P.S. I know @makuna is working on "unified" buffers where only the transmit buffers are device dependant, but render buffers are not. That change (IIRC) will also bring lossless GetPixelColor() which will simplify (and speed up) our getPixelColor(). So working with opacity/alpha may become easy & fast.

blazoncek avatar Sep 17 '24 18:09 blazoncek

@DedeHai I've made a few modifications but cannot push to this PR. Can you check if updates are enabled for collaborators?

blazoncek avatar Sep 23 '24 16:09 blazoncek

enabled now.

DedeHai avatar Sep 24 '24 04:09 DedeHai

Latest changes from @blazoncek have a huge positive impact on FPS however the addition of bool unScaled has no measurable benefit and adds to code size. Here are my test results: Vanilla 0_15: 52FPS RAM: [== ] 15.3% (used 50020 bytes from 327680 bytes) Flash: [========= ] 89.7% (used 1410392 bytes from 1572864 bytes)

My changes including https://github.com/Aircoookie/WLED/pull/4138/commits/202901b09f2a8c6da3a453b0d3ed8c8140ea5557: 54FPS RAM: [== ] 15.3% (used 50020 bytes from 327680 bytes) Flash: [========= ] 89.5% (used 1407852 bytes from 1572864 bytes) -> saves 2540bytes

Latest addition from blazoncek https://github.com/Aircoookie/WLED/pull/4138/commits/9114867578477e4cf5fa652769ea0690503c5f10: 65FPS (!) RAM: [== ] 15.3% (used 50036 bytes from 327680 bytes) Flash: [========= ] 89.6% (used 1409002 bytes from 1572864 bytes) -> adds 1150bytes

I have a commit ready removing the bool unScaled again if there are no objections (also including some fixes and improvements)

DedeHai avatar Sep 29 '24 09:09 DedeHai

I have a commit ready removing the bool unScaled again if there are no objections (also including some fixes and improvements)

@DedeHai the "unscaled" optimization only affects Segment::drawCircle and Segment::drawLine with anti-aliased enabled. Did you test with an effect that uses these functions?

softhack007 avatar Sep 29 '24 11:09 softhack007

Yes, I tested it using "Ripple" and "Blobs" FX (4 overlapping segments to intensify the calculation) and saw no FPS change in either case.

DedeHai avatar Sep 29 '24 11:09 DedeHai

Yes, I tested it using "Ripple" and "Blobs" FX (4 overlapping segments to intensify the calculation) and saw no FPS change in either case.

Strange ... actually "unscaled=false" removes color_fade(c, currentBri()) from the inner loop, this should have a noticeable effect. Can you test again with a bigger matrix setup? I would expect that some speedup gets visible.

Also keep in mind that our "fps" calc is quite inaccurate, like +- 2 fps of "uncertainty" and almost meaningless above 80fps.

The "Loops/sec: " debug output (WLED_DEBUG) is actually more reliable, because it averages over 30 seconds.

softhack007 avatar Sep 29 '24 11:09 softhack007

I tested various versions, with scaling and without scaling (completely commented out) and saw no real impact. Maybe my optimized scaling function is now so fast that it takes a lot of pixels to actually make a difference (I am sure it will on 1000+). I just discussed with blazoncek to commit my changes. A different approach would be to use a static variable instead of passing a parameter (which is the actual problem eating up flash on every call throughout the code).

DedeHai avatar Sep 29 '24 11:09 DedeHai