Improvements to pinwheel algorithm, should work on any size and should also be faster
- should now work on larger setups, tested at vaious sizes from 8x8 - 32x32 and on 128x16
- Increased fixed point shift to 15bit
- fixed 'pixel holes' on large sizes
- speed should be improved too
- consolidated code to save flash (about -300 bytes)
- reduced 'pixel stepping' to 50% steps to fix pixel holes (increases number of loops but it has no measurable impact on speed)
@Brandon502 please check if this is better or worse
@DedeHai With the latest commit I am see way more holes compared to your first version. 24x24 has holes and large matrices have quite a few. I struggled finding holes on your first version.
Wavesin pattern looks pretty nice with your method of reducing center writes instead of jumping.
Wavesin pattern looks pretty nice with your method of reducing center writes instead of jumping.
how do you get the full preview? peek is always reduced for me.
how do you get the full preview? peek is always reduced for me.
You need to change the "max preview pixels" in the code: https://github.com/Aircoookie/WLED/blob/7deea9eb75e651b19c04217c5fabde3f544c542b/wled00/ws.cpp#L172
well that was easy. I have a few more ideas I want to test, so this will come in handy. I have it working at small sizes but need to test larger ones. Changes should fix the missing pixel issues hopefully on all resolutions while still reducing pixels drawn.
I spent way too much time checking different options ;) Bottom line is:
- radius based algorithms give a better look in the center but add calculation overhead for each step, making it slow
- I was able to get close to the performance of the current version (skipping odd center rays) but tuning it such that it works on all sizes failed so far. there is always a compromise to increase the number of angular steps, slowing it down so in the end it was about 20% slower
- Currently looking at a version that keeps the odd ray skipping, adding some more center skips and calculating steps on the fly (instead of using the "magic table")
So there always will be a trade-off between looks and performance. btw: I wrote a code snippet (thx ChatGPT) to evaluate different algorithms over all sizes, printing out number of pixels drawn, number of zero pixels and number of 'dead loops' (no pixels drawn) for any square size. Current work in progress: optimize for 8x8, 16x16, 32x32 as these are most common and use more generic parameters that work on any other size. If all goes well, there should be up to 20% speed boost. Let me know if you want me to put that test-code here.
If you'd be using Bresenham's algorithm your gaps would've been known in advance. Just a thought.
I don't think Bresenhams algorithm can solve the issue. I briefly looked at it in the beginning and while it is really good at efficiently drawing a line, it does not solve the over-drawing issue, the prevention of which is the actual cause for the gaps: it is easy to just fill the whole matrix, with lines that doe not leave gaps, but that leads to something like a 3.5x overdraw. Current algorithm brings it down to about 2.2x. If my latest Ideas work, it will bring it down to 1.9x.
As we already know, painting a pixel is expensive. Iterating through an array of "painted" pixels is not. Deducting coordinates/angles from unpainted pixels (usually not that many of them) is straightforward. Why paint 1.9x when you can do it only once? The expense would be RAM for an array of "painted" pixels and a few "reverse" calculations of "unpainted" pixels.
if having an array as a "buffer", it would even be easier to pre-calculate a map. This was suggested earlier by @Brandon502 but it eats quite a lot of ram, especially if doing a large matrix, like 32x32 and larger. If ram can be expended, this would by far be the best and fastest solution. If matrix size is restricted to 256x256, a byte array would be sufficient. Edit: thinking about it, a map may also not be so easy to define. maybe need to explain, why it is a bit complicated to not overpaint: the function gets handed a single color value, not knowing anything about the future and the past it needs to convert a single point in a 1D array into a polar coordinate line. so where to start painting is the question that is unsolved. With just a 'stupid' line drawing algorithm for example, the center pixel gets painted many many times (I counted over 80 times at 32x32). Now if you skip painting it, which 1D position actually is priviledged to paint it? and what if that pixel is not set by the FX, will it stay black or does another one 'inherit' ? and how to know all that just by getting passed a color and a 1D position... Edit2: to efficiently translate, it would actually need to be an array containing variable length arrays as for each 1D coordinate, there are a certain number of X/Y coordinates to be painted and that number is not constant. But I am sure, someone already thought of a good solution as mapping to cartesian coordinates is an every-day kind of problem.
Why not just use "mean" value of the surrounding pixels? The visual "error" would be the least. Which is more important than "actual" color rendered at the missing pixel. Getting "mean" value may be expensive (9x call to getPixelColorXY) if no other measures are taken. I am just tossing ideas. I have not tried anything mentioned above.
EDIT: Memory requirements are 4kB for 64x64 matrix if using unpacked bool values. If using packed bits (at the expense of extracting each bit of information) that's only 512 bytes.
using packed bits to keep track of painted pixels had passed my mind at one point. thanks for the ideas, that may get the best of both worlds: live painting with no over-painting with minimal use of ram. I am just asking myself, how to handle such a buffer. It should be dynamic so heap or stack? and how to prevent fragmentation with many segments using this?
4k might be a lot for stack but 512 bytes is viable as it adds no overhead. If you are talking about 128x128 matrix (largest possible ATM without code change) that's still 2k which is somewhat safe to put on stack. But I would not go beyond 2k.
It should be dynamic so heap or stack?
I think it might become tricky to get this buffer into the stack - the stack is cleared when you leave the function - so it might be needed to already allocate the buffer in strip.service() and find a way to pass down the pointer until we arrive in segment::setPixelColor.
The other problem is you might need "alloca". Because a definition like bytes drawn[seg.width * seg.height / 8] will go to the heap (dynamically sized array) while bytes drawn[2048] will go into the stack.
I reworked it with a different idea. Uses Bersenham's algorithm and doesn't use memory. Haven't redid getPixelColor or found the perfect amount of rays per size yet. Can likely optimize quite a bit more since my C++ isn't great.
64x64 using 36 rays. Can see a bit of banding.
64x64 using 72 rays. Seems decent, needs to be testing on physical setup.
My fork is from MM so these numbers are based off of that. But original pinwheel on MM build esp32_4MB_S 64x64 8 pins 512 ws2812b ran at 20 loops/s. 72 rays with new method is at 25 loops/s. Edit: Pixels for comparison is 26 loops/s. Bar is 30 loops/s
Haven't found any holes. Can likely improve this a bit further.
Haven't found any holes. Can likely improve this a bit further.
@Brandon502 The MM version of drawLine has an additional option to skip parts of the line (added for the new GEQ3D effect). Maybe this could be useful to skip inner parts of a line (like the previous odd-rays approach).
Should be easy to port to upstream https://github.com/MoonModules/WLED/blob/98bdbd1eb2e3c6a07a17a7f966cf1ec0cb83804b/wled00/FX_2Dfcn.cpp#L783-L795
Haven't found any holes. Can likely improve this a bit further.
@Brandon502 The MM version of drawLine has an additional option to skip parts of the line (added for the new GEQ3D effect). Maybe this could be useful to skip inner parts of a line (like the previous odd-rays approach).
Should be easy to port to upstream https://github.com/MoonModules/WLED/blob/98bdbd1eb2e3c6a07a17a7f966cf1ec0cb83804b/wled00/FX_2Dfcn.cpp#L783-L795
I took Bresenham's algorithm from drawLine. Right now a single ray does not overwrite any pixels. But each ray slightly overlaps the next. Since rays are much thicker a lot of the effects look much better in my opinion. And getPixelColor is likely much more accurate.
Prev Noise4:
New Noise4:
Wavesins no longer has the moirè pattern though.
BTW & FYI if you'd fork AC and add MM as a remote, you'd have no problem of pulling and pushing into both (provided you have write access) or submitting PR into both without merge issues.
As for the development (and thinking) I am happy that it has turned this way. BTW the banding above is from a "still shot" it may be invisible when moving.
As an optimisation for drawing: if you can determine the length of a ray you can draw shorter rays for those that overlap.
@softhack007 et. al. I will steal this post to urge you to test 0_15 ABL so we can release b6 and soon after RC or 0.15 Then we can start merging these improvements into beta of 0.15.1
I took Bresenham's algorithm from drawLine. Right now a single ray does not overwrite any pixels. But each ray slightly overlaps the next. Since rays are much thicker a lot of the effects look much better in my opinion. And getPixelColor is likely much more accurate.
Looks really good and may likelly to be the better approach - got any code to share? I would like to compare it my current state
I took Bresenham's algorithm from drawLine. Right now a single ray does not overwrite any pixels. But each ray slightly overlaps the next. Since rays are much thicker a lot of the effects look much better in my opinion. And getPixelColor is likely much more accurate.
Looks really good and may likelly to be the better approach - got any code to share? I would like to compare it my current state
I pushed it on this branch. https://github.com/Brandon502/WLED/commit/761e2d7fc98681a54d0d0829b9e055d20c7e9295 still really rough, getPixelColor isn't working correctly on all sizes.
@Brandon502 I tested your 'rough' code and after a few fixes it runs (I had to increase listLen).
There is some room for optimizations but all in all it works pretty well!
What I like about it is the "filling between lines" which really lets you dial in the number of angular steps without adding any holes. Also it avoids the "odd ray skipping" that is required in my updated version to gain speed. While your raw version is a bit slower than my optimized one, it looks much better on some effects (as your screenshots show). I think the Bresenham algorithmi with gap filling is the way to go. It makes tweaking much easier: my verison has 4 or 5 parameters to dial in.
Also it allows very easy optimization for ESP8266 by just lowering the angle steps for example.
I will analyze your code in more detail and integrate it into my updated code and then add some optimizations and push to this PR for testing.
Edit:
Theser are the numbers I get for "overdraw":
Pixels drawn with 32x32 matrix (1024 pixels):
Vanilla 0.15: 2344
My version (Optimized for 32x32 so best case): 1804
Bresenham, 60 rays: 2054
I am sorry I never bothered with Pinwheel expansion but I see floating point math now that I took a peek (at @Brandon502 implementation).
When drawing lines, you always know the starting and ending pixels (as you are drawing from center pixel to the edge pixel). So the maximum number of lines is known in advance (2xW + 2xH). It is true that this may produce uneven angle between them but IMO that is a negligible error that may not greatly affect the output. Another optimisation is, that, since all lines start at the same point in center, every other line can be half the length (as does the current pinwheel expansion in 0_15). You could further reduce the length on very large matrices IMO.
Just a thought.
you are right about all these points. the floating points will not influence the speed (and also: I will replace most of them with fixed point). The "reducing every second line" is something that leads to ugly artefacts if effects draw every second pixel, as many do. Not reducing the rays is one of the advantages. I will look into tracking drawn pixels as previously discussed but I am still looking for a highly efficient way to map the x/y coordinates to a bit field: if those calculateions take more than a few CPU cycles it may not speed up the process at all. i.e. it must be way faster than doing a setPixelColor() call. So doing something like:
#define BIT_INDEX(x, y, width) ((y) * (width) + (x)) / 32 #define BIT_MASK(x, y, width) 1U << (((y) * (width) + (x)) % 32) (not really good code, just as a quick example of what I mean)
may already be too many computations... I would like something that is quicker for line drawing, where you can start at a base value (x0/y0) calculate the index and the mask and then just do additions and mask-shifts as x and y progress (not sure it is feasible).
@DedeHai Do you have your newest version uploaded anywhere? Would like to see the code and test the look/performance. Weird that lineLen needed changing, what size broke it? I originally had extra space, but it never was used so got rid of it and it seemed fine.
Last night I added center jumping and slightly change the angle of consecutive rays which gives a decent boost, but made gPC much worse. https://github.com/Brandon502/WLED/commit/2c7fe518f88eabd60adc34f696a93cc248d0d97f. @blazoncek the jump wasn't able to be half length but somewhere between 1/3 - 1/4, maybe that can be improved. Here's draw counts for 64x64. 4096 pixels. 72 rays. I can get around 28 loops/s now.
| Description | count |
|---|---|
| Current MM pinwheel | 10948 |
| New pinwheel | 6417 |
| With jump | 5949 |
| Reduce overlap with jump | 5252 |
I tested a very small amount of rays just to see speed. Turns out it is much worse which I don't understand. The draw count shot up for some reason when I expected way fewer writes.
So gPC I think is impossible to get perfect on rectangle setups. The short side is so squished that odd and even rays constantly overwrite which breaks the effects that rely on it like DJ Light/Freqmatrix. The current pinwheel is also not working with these on certain rectangles. @softhack007 MM block and circle also seem broken. Would it be possible to just store all 72 color values somewhere and have gPC retrieve there? <300 bytes to store, but each segment would require its' own. Ideally it would only need space if gPC is being called on an effect.
Quickly tested the writes for 32x32. 60 rays - 1751 writes 45 rays - 1561 writes - 99 loops/s using 256 leds/pin
Do you have your newest version uploaded anywhere?
Doh, I commited it to this PR but forgot to push. Will do it within the hour.
Weird that lineLen needed changing, what size broke it?
I am testing on S3, 32x32 and it took me quite a while to figure out, what the problem was. increasing the size seemed to fix it, I did not yet check if it was something else I might have changed at the same time.
the jump wasn't able to be half length but somewhere between 1/3 - 1/4
I tested adaptive jumping, depending on size and position, something like jump = i % maxXY/8 which removes some overdraw but not as much as a 1/3 every other ray.
I tested a very small amount of rays just to see speed. Turns out it is much worse which I don't understand. The draw count shot up for some reason when I expected way fewer writes.
I think that is still a bug in your code somewhere, initially saw that too, it dropped to 7FPS
The current pinwheel is also not working with these on certain rectangles. @softhack007 MM block and circle also seem broken.
@Brandon502 yes MM block and MM circle are somewhat special, because they support "virtual strip" mode so it can get complicated to find a good reference pixel for gPC. I'll need to discuss with @ewoudwijma who created the features initially. Thanks for the hint.
Would it be possible to just store all 72 color values somewhere and have gPC retrieve there?
Yes I think that would be the right approach - an extra buffer that's aligned with the pinwheel "length" will remove all these problems with gPC. We could also reduce program size a bit because we won't need to reproduce the drawing algo in gPC any more.
3 * length bytes does not seem too much on esp32, not even with the HUB75 driver that's extremely memory hungry. I can try something this weekend - should be enough to add an extra CRGB* to the segment class, and allocate the buffer at the first sPC with pinwheel mapping. Segment copy/move constructors might be a challenge.
Not sure about 8266 - but 8266 is not a good idea for large leds count any way (24x24 = 576 pixels), so buffering should be possible there, too.
I have working code using a 1bit map to check if a pixel has been drawn. it is fast!. I have it working on a single line draw only and it does not look too good unfortunately. But I almost get full FPS (compared to other mappings) even when drawing hundreds of lines. Need to adapt it back to the two-line-fill-between approach, I hope that is possible without too much modification. If I get it working, I think this will be the way to go.
I have working code using a 1bit map to check if a pixel has been drawn. it is fast!. I have it working on a single line draw only and it does not look too good unfortunately. But I almost get full FPS (compared to other mappings) even when drawing hundreds of lines. Need to adapt it back to the two-line-fill-between approach, I hope that is possible without too much modification. If I get it working, I think this will be the way to go.
I quickly added a bit field into mine and just used the get and set bit functions I used in Game of Life in MM, so it probably isn't the fastest method. But now it draws exactly 4096 on 64x64 with no loops/s improvement Edit: maybe +1 loop/s.