WLED icon indicating copy to clipboard operation
WLED copied to clipboard

DDP receiving strip effect spontaneous effect problems (animation lagged and choppy)

Open blark opened this issue 3 years ago • 19 comments

What happened?

I am using two ESP32s with 0.13.0-b6. My wifi connection is good, I am using external antennas and have good RSSI.

Strip 1 has ~550 LEDs Strip 2 is connected via DDP and has ~200 LEDs

Effects run fine when I first select them, but then after some time (this is not consistent) they become choppy and lagged on the second DDP strip. Stopping the animation and restarting it has no effect, the 2nd controller must be rebooted -- then it immediately begins working fine again. Eventually the effect will degrade again and exhibit the same choppiness.

To Reproduce Bug

Set up a similar setup to the one described above, and wait some time while an effect is active.

I noticed that when I scroll through the list of effects and change them somewhat quickly (to see what they look like) it seems to speed up the onset of this behaviour.

Expected Behavior

Effects should work properly on both strips.

Install Method

Binary from WLED.me

What version of WLED?

WLED 0.13.0-b6

Which microcontroller/board are you seeing the problem on?

ESP32

Relevant log/trace output

No response

Anything else?

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

blark avatar Dec 14 '21 15:12 blark

What is the time-frame for choppiness? I am running this on Ethernet enabled ESP32 with 1000 client (network) pixels on 3 ESPs (ESP01, Wemos D1 mini, ESP32) and not seeing any choppiness or stutter or other issues.

blazoncek avatar Dec 14 '21 16:12 blazoncek

@blark, I had the same issue when using WiFi, and ultimately resolved it by switching to Ethernet connected instead.

Like you, I know my WiFi is good (enterprise equipment, 10g lan, isolated RF, etc..), but couldn't figure out why there were occasional glitches or freezing. Eventually I setup Wireshark, and confirmed occasional delays between packets which aligned to the display issues seen. Tracing this back to the access point, I confirmed the delay was actually coming from the esp32. No connection or packet errors - it just wasn't sending data every once and a while.

After enabling debug in WLED, I noticed errors thrown by WifiUDP code in Arduino-ESP32. Although it recovers, every time the error is printed this correlated to packet loss (omission?) from esp32. After hours of searching through GitHub, setting up debug output in IDF layers, and testing multiple esp32 boards, I eventually gave up and switched to Ethernet. It has been butter smooth since then, which leads me to believe there's something in the underlying ESP-IDF code.

Perhaps you are seeing the same. Try enabling debug while connected to serial console. Would be interesting to see if you are getting similar errors.

--edit--

Here is the line number to Arduino-ESP32 in WiFiUdp.cpp.

https://github.com/espressif/arduino-esp32/blob/c2c8d189928386c872aa6cd7ba7a87c8019c5663/libraries/WiFi/src/WiFiUdp.cpp#L183

Other tickets for reference ..

  • https://github.com/espressif/arduino-esp32/issues/5455
  • https://github.com/espressif/arduino-esp32/issues/2768
  • https://github.com/espressif/arduino-esp32/issues/1237
  • https://github.com/espressif/arduino-esp32/issues/845

fastbytes avatar Dec 15 '21 03:12 fastbytes

@guardmedia thank you for the amazingly thorough response. I will dig into your links a bit more and do some research!

I already have 2 ethernet boards on order, unfortunately it's going to be a somewhat serious pain in the butt to get one of the ESP's wired up.

Since this seems to be an ESP library issue, should I close the bug?

blark avatar Dec 16 '21 02:12 blark

Digging into it deeper, this looks like a buffer overflow in the WiFi drivers thrown by LWIP. Rate limiting may avoid the issue when using WiFi. Final analysis below for anyone else who stumbles into this issue.

Hooking up my test device to PuTTY with full debug captures the following errors. This message repeats until data recovers, and correlates to the time when DDP data temporarily stops transmitting.

[E][WiFiUdp.cpp:183] endPacket(): could not send data: 12
WiFiUDP.endPacket returned an error
[E][WiFiUdp.cpp:183] endPacket(): could not send data: 12
WiFiUDP.endPacket returned an error
etc...

From LWIP, error result 12 correlates to ENOMEM condition, which indicates an "out of memory" issue.

https://github.com/lwip-tcpip/lwip/blob/b0e347158d8db640c6891f9f31f4e6d19dca200b/src/include/lwip/errno.h#L59

...
#define  EAGAIN          11  /* Try again */
#define  ENOMEM          12  /* Out of memory */
#define  EACCES          13  /* Permission denied */
...

Finally, leading to Espressif documentation on lwIP which calls out this limitation when sending data repeatedly on a UDP socket.

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/lwip.html#limitations

Limitations

Calling send() or sendto() repeatedly on a UDP socket may eventually fail with errno equal to ENOMEM. This is a limitation of buffer sizes in the lower layer network interface drivers. If all driver transmit buffers are full then UDP transmission will fail. Applications sending a high volume of UDP datagrams who don’t wish for any to be dropped by the sender should check for this error code and re-send the datagram after a short delay.

Increasing the number of TX buffers in the Wi-Fi or Ethernet project configuration (as applicable) may also help.

This can be avoided by rate limiting UDP traffic. Additional consideration is needed for the number of pixels being driven, since DDP only allows 1440 bytes (480 pixels) per packet.

After benchmarking various data rates, I was able to implement simple rate limiter based on millis() targeting 1 packet per 12 ms which seems to prevent the issue from being shown. Although 11 ms / frame mostly worked, there were still occasional ENOMEM errors thrown with my particular test device. Note, this limits displays to 960 pixels (or 2 packets) for us to be able to yield roughly 42 fps. This seems to be required even for small displays due to presumably variable data "burst" if too many frames are processed at once.

For comparison, when using Ethernet the ESP32 is able to easily drive 4096 pixels to the maximum framerate (~42 FPS) over DDP without issue. The same on WiFi with this delay introduced would drop to only ~10 FPS.

  • 480 pixels = 42 FPS
  • 960 pixels =~ 42 FPS
  • 1440 pixels =~ 28 FPS
  • 2880 pixels =~ 14 FPS
  • 4096 pixels =~ 10 FPS

In summary, this error is related to specifically UDP buffers in the WiFi driver which are non-blocking and cause underlying code to run into out of memory issues. ESP-IDF could be recompiled with increased TX buffers, however that may only extend the time between glitches instead of avoiding them all together.

fastbytes avatar Dec 16 '21 06:12 fastbytes

Excellent find and diagnose! Thank you. Do you have a recommendation on where to put rate limiting? Should it be bus (higher level) or network output (lower level)? Or should we just skip a packet if buffers are full?

I am tagging @pbolduc who originally developed network code to be aware of the issue (and I wrote the bus code).

EDIT: I see you already submitted a PR. Thanks, will have a look..

blazoncek avatar Dec 16 '21 07:12 blazoncek

If I am reading the diagnosis correctly, the sender is exhausting the available UDP TX buffers. I am trying to piece together why a reboot of the receiving device works to clear the issue for me (at least temporarily).

Is there some condition where if the target device becomes unavailable on the network that UDP sends fail completely and the buffer is cleared? I will attempt instead to reboot the transmitting device and see if it also helps to clear the issue up. Again thanks for the fantastic work here in tracking this down @guardmedia!

I am wondering if there is a way to make the RX end more loss-resilient. Could it notify the TX side when it sees large gaps in data caused by packet loss?

blark avatar Dec 16 '21 15:12 blark

UDP uses send and forget principle. Does not deal with acknowledgements from client.

blazoncek avatar Dec 16 '21 16:12 blazoncek

Very interesting and thank you @guardmedia for digging into the cause of the issue. I wonder at what rate limit would make it stable and NOT have people complain about the performance or compare it to Falcon Player/xLights. I dont have time today look into this further. But perhaps tomorrow after work when my vacation time starts.

pbolduc avatar Dec 16 '21 16:12 pbolduc

UDP uses send and forget principle. Does not deal with acknowledgements from client.

I understand, I was making the assumption that there would be some sequence/identifier on the receiver so it could identify drops. Ie. "last animation frame was 10, next frame received was 1000 ..." but maybe that's not the case in the DDP protocol.

[EDIT] I Googled and found that packets have a timecode, this could be used, right?

Timecode Usage

Use and support of timecodes is optional.
The Timecode is the 32 middle bits of 64-bit NTP time: 16-bits for seconds and 16-bits for fraction of seconds.

Full 64-bit NTP time is not used because DDP is a real-time protocol, and 16 bits for seconds (1092 minutes) is plenty to tell if the packet is early or late. In addition, 16 bits for fractions of a second (15 microsecond resolution) is more than enough needed for display use.

To use timecode when writing data, set Timecode flag. Append 4 bytes (MSB first) of timecode at end of packet header.

Timecode specifies the time the Display should output data at (only when used with Push flag!). Data received after specified time has past should still be placed into device frame buffer, but receiver can decide whether or not to display it, and/or how long late means. The parameters could be defined by JSON config.

blark avatar Dec 16 '21 21:12 blark

I wonder at what rate limit would make it stable and NOT have people complain about the performance or compare it to Falcon Player/xLights

@pbolduc, Assuming all ESP32 boards are created equal, I found this to be stable at 12ms per UDP packet. At 11ms, there were still occasional ENOMEM errors being thrown however it was able to recover fairly quickly. At 10ms, these errors became much more frequent, and it would take longer for the TX buffer to clear before any new UDP packets could be sent.

This delay could be further minimized by wrapping specifically the ddpUdp.endPacket() function with logic to retry X times if it failed, but I couldn't think of a safe way to implement this without potentially causing other side effects for non ENOMEM errors since the errno value is not exposed outside of WiFiUDP.cpp in the Arduino layer.

int WiFiUDP::endPacket(){
  struct sockaddr_in recipient;
  recipient.sin_addr.s_addr = (uint32_t)remote_ip;
  recipient.sin_family = AF_INET;
  recipient.sin_port = htons(remote_port);
  int sent = sendto(udp_server, tx_buffer, tx_buffer_len, 0, (struct sockaddr*) &recipient, sizeof(recipient));
  if(sent < 0){
    log_e("could not send data: %d", errno);
    return 0;
  }
  return 1;
}

It should also be noted that Espressif only mentions an issue with the send() and sendto() functions when using a UDP socket, and has no mention on RX buffers. When testing from xLights -> ESP32 (WLED) at maximum bounds (e.g 32,768 pixels for Falcon F16V4), although no errors are thrown in debug, it does seem to be rate limited by either hardware or underlying libraries to roughly 1 packet / 3 milliseconds (benchmarked at 9703 packets received over 30021 milliseconds).

numPackets:5076, millis():128188
numPackets:6623, millis():133190
numPackets:8391, millis():138192
numPackets:10251, millis():143197
numPackets:12166, millis():148204
numPackets:13485, millis():153207
numPackets:14779, millis():158209

Given that rate, WLED should be able to safely consume up to 3,693 pixels at a rate of 42 FPS, or 5,171 pixels at a rate of 30 FPS. Both are much faster than what seems possible when transmitting data outbound.

fastbytes avatar Dec 17 '21 05:12 fastbytes

... I was making the assumption that there would be some sequence/identifier on the receiver so it could identify drops. Ie. "last animation frame was 10, next frame received was 1000 ..."

What would be the point of that except if packets arrive in incorrect order (which is unlikely but still)? Lost UDP packet is lost forever.

blazoncek avatar Dec 17 '21 05:12 blazoncek

What would be the point of that except if packets arrive in incorrect order (which is unlikely but still)? Lost UDP packet is lost forever.

To identify unusual gaps between packets on the receiving side. Just a thought. Seems like a valuable diagnostic to me.

blark avatar Dec 17 '21 07:12 blark

Useful? Diagnostic: yes (not implemented on either side though) Operations: only if having packet buffer and use packets with latency/delay. But nobody wants latency. There was a "UDP smoothing" PR a while ago that implemented packet buffering.

blazoncek avatar Dec 17 '21 08:12 blazoncek

FYI, I was able to replace my DDP sending WLED with a wired one. The DDP receiver strip is still wireless (/w ext antenna). Performance is much better, while I sometimes saw choppiness on the receiver it was only momentary.

I believe this backs up @guardmedia's research, that it is overflow on the sending side that is causing this behaviour.

blark avatar Dec 21 '21 17:12 blark

So we can close this issue? The base problem will be resolved elsewhere.

blazoncek avatar Dec 21 '21 17:12 blazoncek

I spoke too soon, having the master (DDP sender) wired and receiver wireless, I still am seeing the issue.

Close up the issue if it's being dealt with elsewhere. Is there some other issue I can follow to track this? Thanks!

blark avatar Dec 22 '21 01:12 blark

As discussed above this is ESP core issue (TCP stack), not WLED. It may be mitigated in WLED if we find appropriate solution (not the one proposed above ATM).

blazoncek avatar Dec 22 '21 06:12 blazoncek

I seem to have a similar issue while receiving sACN at around every 3 to 5 sec the output freezes for a sec and then continues. But it only happens on v0.13, the v0.12 works well with eth and bit choppy sometimes on wifi, but that could be related to signal.

NormenL avatar Jan 19 '22 10:01 NormenL

Hey! This issue has been open for quite some time without any new comments now. It will be closed automatically in a week if no further activity occurs. Thank you for using WLED!

stale[bot] avatar Jul 10 '22 22:07 stale[bot]