WLED
WLED copied to clipboard
DDP receiving strip effect spontaneous effect problems (animation lagged and choppy)
What happened?
I am using two ESP32s with 0.13.0-b6. My wifi connection is good, I am using external antennas and have good RSSI.
Strip 1 has ~550 LEDs Strip 2 is connected via DDP and has ~200 LEDs
Effects run fine when I first select them, but then after some time (this is not consistent) they become choppy and lagged on the second DDP strip. Stopping the animation and restarting it has no effect, the 2nd controller must be rebooted -- then it immediately begins working fine again. Eventually the effect will degrade again and exhibit the same choppiness.
To Reproduce Bug
Set up a similar setup to the one described above, and wait some time while an effect is active.
I noticed that when I scroll through the list of effects and change them somewhat quickly (to see what they look like) it seems to speed up the onset of this behaviour.
Expected Behavior
Effects should work properly on both strips.
Install Method
Binary from WLED.me
What version of WLED?
WLED 0.13.0-b6
Which microcontroller/board are you seeing the problem on?
ESP32
Relevant log/trace output
No response
Anything else?
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
What is the time-frame for choppiness? I am running this on Ethernet enabled ESP32 with 1000 client (network) pixels on 3 ESPs (ESP01, Wemos D1 mini, ESP32) and not seeing any choppiness or stutter or other issues.
@blark, I had the same issue when using WiFi, and ultimately resolved it by switching to Ethernet connected instead.
Like you, I know my WiFi is good (enterprise equipment, 10g lan, isolated RF, etc..), but couldn't figure out why there were occasional glitches or freezing. Eventually I setup Wireshark, and confirmed occasional delays between packets which aligned to the display issues seen. Tracing this back to the access point, I confirmed the delay was actually coming from the esp32. No connection or packet errors - it just wasn't sending data every once and a while.
After enabling debug in WLED, I noticed errors thrown by WifiUDP code in Arduino-ESP32. Although it recovers, every time the error is printed this correlated to packet loss (omission?) from esp32. After hours of searching through GitHub, setting up debug output in IDF layers, and testing multiple esp32 boards, I eventually gave up and switched to Ethernet. It has been butter smooth since then, which leads me to believe there's something in the underlying ESP-IDF code.
Perhaps you are seeing the same. Try enabling debug while connected to serial console. Would be interesting to see if you are getting similar errors.
--edit--
Here is the line number to Arduino-ESP32 in WiFiUdp.cpp.
https://github.com/espressif/arduino-esp32/blob/c2c8d189928386c872aa6cd7ba7a87c8019c5663/libraries/WiFi/src/WiFiUdp.cpp#L183
Other tickets for reference ..
- https://github.com/espressif/arduino-esp32/issues/5455
- https://github.com/espressif/arduino-esp32/issues/2768
- https://github.com/espressif/arduino-esp32/issues/1237
- https://github.com/espressif/arduino-esp32/issues/845
@guardmedia thank you for the amazingly thorough response. I will dig into your links a bit more and do some research!
I already have 2 ethernet boards on order, unfortunately it's going to be a somewhat serious pain in the butt to get one of the ESP's wired up.
Since this seems to be an ESP library issue, should I close the bug?
Digging into it deeper, this looks like a buffer overflow in the WiFi drivers thrown by LWIP. Rate limiting may avoid the issue when using WiFi. Final analysis below for anyone else who stumbles into this issue.
Hooking up my test device to PuTTY with full debug captures the following errors. This message repeats until data recovers, and correlates to the time when DDP data temporarily stops transmitting.
[E][WiFiUdp.cpp:183] endPacket(): could not send data: 12
WiFiUDP.endPacket returned an error
[E][WiFiUdp.cpp:183] endPacket(): could not send data: 12
WiFiUDP.endPacket returned an error
etc...
From LWIP, error result 12
correlates to ENOMEM
condition, which indicates an "out of memory" issue.
https://github.com/lwip-tcpip/lwip/blob/b0e347158d8db640c6891f9f31f4e6d19dca200b/src/include/lwip/errno.h#L59
...
#define EAGAIN 11 /* Try again */
#define ENOMEM 12 /* Out of memory */
#define EACCES 13 /* Permission denied */
...
Finally, leading to Espressif documentation on lwIP which calls out this limitation when sending data repeatedly on a UDP socket.
https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/lwip.html#limitations
Limitations
Calling
send()
orsendto()
repeatedly on a UDP socket may eventually fail witherrno
equal toENOMEM
. This is a limitation of buffer sizes in the lower layer network interface drivers. If all driver transmit buffers are full then UDP transmission will fail. Applications sending a high volume of UDP datagrams who don’t wish for any to be dropped by the sender should check for this error code and re-send the datagram after a short delay.Increasing the number of TX buffers in the Wi-Fi or Ethernet project configuration (as applicable) may also help.
This can be avoided by rate limiting UDP traffic. Additional consideration is needed for the number of pixels being driven, since DDP only allows 1440 bytes (480 pixels) per packet.
After benchmarking various data rates, I was able to implement simple rate limiter based on millis() targeting 1 packet per 12 ms which seems to prevent the issue from being shown. Although 11 ms / frame mostly worked, there were still occasional ENOMEM
errors thrown with my particular test device. Note, this limits displays to 960 pixels (or 2 packets) for us to be able to yield roughly 42 fps. This seems to be required even for small displays due to presumably variable data "burst" if too many frames are processed at once.
For comparison, when using Ethernet the ESP32 is able to easily drive 4096 pixels to the maximum framerate (~42 FPS) over DDP without issue. The same on WiFi with this delay introduced would drop to only ~10 FPS.
- 480 pixels = 42 FPS
- 960 pixels =~ 42 FPS
- 1440 pixels =~ 28 FPS
- 2880 pixels =~ 14 FPS
- 4096 pixels =~ 10 FPS
In summary, this error is related to specifically UDP buffers in the WiFi driver which are non-blocking and cause underlying code to run into out of memory issues. ESP-IDF could be recompiled with increased TX buffers, however that may only extend the time between glitches instead of avoiding them all together.
Excellent find and diagnose! Thank you. Do you have a recommendation on where to put rate limiting? Should it be bus (higher level) or network output (lower level)? Or should we just skip a packet if buffers are full?
I am tagging @pbolduc who originally developed network code to be aware of the issue (and I wrote the bus code).
EDIT: I see you already submitted a PR. Thanks, will have a look..
If I am reading the diagnosis correctly, the sender is exhausting the available UDP TX buffers. I am trying to piece together why a reboot of the receiving device works to clear the issue for me (at least temporarily).
Is there some condition where if the target device becomes unavailable on the network that UDP sends fail completely and the buffer is cleared? I will attempt instead to reboot the transmitting device and see if it also helps to clear the issue up. Again thanks for the fantastic work here in tracking this down @guardmedia!
I am wondering if there is a way to make the RX end more loss-resilient. Could it notify the TX side when it sees large gaps in data caused by packet loss?
UDP uses send and forget principle. Does not deal with acknowledgements from client.
Very interesting and thank you @guardmedia for digging into the cause of the issue. I wonder at what rate limit would make it stable and NOT have people complain about the performance or compare it to Falcon Player/xLights. I dont have time today look into this further. But perhaps tomorrow after work when my vacation time starts.
UDP uses send and forget principle. Does not deal with acknowledgements from client.
I understand, I was making the assumption that there would be some sequence/identifier on the receiver so it could identify drops. Ie. "last animation frame was 10, next frame received was 1000 ..." but maybe that's not the case in the DDP protocol.
[EDIT] I Googled and found that packets have a timecode, this could be used, right?
Timecode Usage
Use and support of timecodes is optional.
The Timecode is the 32 middle bits of 64-bit NTP time: 16-bits for seconds and 16-bits for fraction of seconds.
Full 64-bit NTP time is not used because DDP is a real-time protocol, and 16 bits for seconds (1092 minutes) is plenty to tell if the packet is early or late. In addition, 16 bits for fractions of a second (15 microsecond resolution) is more than enough needed for display use.
To use timecode when writing data, set Timecode flag. Append 4 bytes (MSB first) of timecode at end of packet header.
Timecode specifies the time the Display should output data at (only when used with Push flag!). Data received after specified time has past should still be placed into device frame buffer, but receiver can decide whether or not to display it, and/or how long late means. The parameters could be defined by JSON config.
I wonder at what rate limit would make it stable and NOT have people complain about the performance or compare it to Falcon Player/xLights
@pbolduc, Assuming all ESP32 boards are created equal, I found this to be stable at 12ms per UDP packet. At 11ms, there were still occasional ENOMEM
errors being thrown however it was able to recover fairly quickly. At 10ms, these errors became much more frequent, and it would take longer for the TX buffer to clear before any new UDP packets could be sent.
This delay could be further minimized by wrapping specifically the ddpUdp.endPacket()
function with logic to retry X times if it failed, but I couldn't think of a safe way to implement this without potentially causing other side effects for non ENOMEM
errors since the errno
value is not exposed outside of WiFiUDP.cpp
in the Arduino layer.
int WiFiUDP::endPacket(){
struct sockaddr_in recipient;
recipient.sin_addr.s_addr = (uint32_t)remote_ip;
recipient.sin_family = AF_INET;
recipient.sin_port = htons(remote_port);
int sent = sendto(udp_server, tx_buffer, tx_buffer_len, 0, (struct sockaddr*) &recipient, sizeof(recipient));
if(sent < 0){
log_e("could not send data: %d", errno);
return 0;
}
return 1;
}
It should also be noted that Espressif only mentions an issue with the send()
and sendto()
functions when using a UDP socket, and has no mention on RX buffers. When testing from xLights -> ESP32 (WLED) at maximum bounds (e.g 32,768 pixels for Falcon F16V4), although no errors are thrown in debug, it does seem to be rate limited by either hardware or underlying libraries to roughly 1 packet / 3 milliseconds (benchmarked at 9703 packets received over 30021 milliseconds).
numPackets:5076, millis():128188
numPackets:6623, millis():133190
numPackets:8391, millis():138192
numPackets:10251, millis():143197
numPackets:12166, millis():148204
numPackets:13485, millis():153207
numPackets:14779, millis():158209
Given that rate, WLED should be able to safely consume up to 3,693 pixels at a rate of 42 FPS, or 5,171 pixels at a rate of 30 FPS. Both are much faster than what seems possible when transmitting data outbound.
... I was making the assumption that there would be some sequence/identifier on the receiver so it could identify drops. Ie. "last animation frame was 10, next frame received was 1000 ..."
What would be the point of that except if packets arrive in incorrect order (which is unlikely but still)? Lost UDP packet is lost forever.
What would be the point of that except if packets arrive in incorrect order (which is unlikely but still)? Lost UDP packet is lost forever.
To identify unusual gaps between packets on the receiving side. Just a thought. Seems like a valuable diagnostic to me.
Useful? Diagnostic: yes (not implemented on either side though) Operations: only if having packet buffer and use packets with latency/delay. But nobody wants latency. There was a "UDP smoothing" PR a while ago that implemented packet buffering.
FYI, I was able to replace my DDP sending WLED with a wired one. The DDP receiver strip is still wireless (/w ext antenna). Performance is much better, while I sometimes saw choppiness on the receiver it was only momentary.
I believe this backs up @guardmedia's research, that it is overflow on the sending side that is causing this behaviour.
So we can close this issue? The base problem will be resolved elsewhere.
I spoke too soon, having the master (DDP sender) wired and receiver wireless, I still am seeing the issue.
Close up the issue if it's being dealt with elsewhere. Is there some other issue I can follow to track this? Thanks!
As discussed above this is ESP core issue (TCP stack), not WLED. It may be mitigated in WLED if we find appropriate solution (not the one proposed above ATM).
I seem to have a similar issue while receiving sACN at around every 3 to 5 sec the output freezes for a sec and then continues. But it only happens on v0.13, the v0.12 works well with eth and bit choppy sometimes on wifi, but that could be related to signal.
Hey! This issue has been open for quite some time without any new comments now. It will be closed automatically in a week if no further activity occurs. Thank you for using WLED!