WLED icon indicating copy to clipboard operation
WLED copied to clipboard

8266: WLED keeps rebooting after 0.14.1 update.

Open Trevo525 opened this issue 1 year ago • 157 comments

What happened?

I have two instances of WLED running on two separate ESP-12F (I believe they are 8266 based?) modules. To be specific, it's this module (not the esp32, obviously). They are wired with different types of LEDs. One is with a WS2812B LED Strip and the other is a more generic LED string that has R|G|B|12V as the inputs, as opposed to 5V|Data|Ground that the first has. I'm not sure that will make a difference. But, I included it as it might be important to note. I just got them both running a week or two ago with WLED 0.14.0 and added them to Home Assistant. Everything worked as expected, I have been using presets and playing with the effects and colors on both. I even have a

However, I updated to 0.14.1 today and the ESP connected to the generic LED strip started turning off when I changed the color it will do that for a split second and I'll notice that the light will switch back to the default orange color. So, I kept testing and it kept happening. Then, I noticed that for a split second after this happens the web interface will be unresponsive for a moment. This leads me to believe the light is restarting.

I have been able to fix this for now by going to the update section and giving it the 0.14.0 interface. But, if I can give any assistance in finding this issue feel free to reach out and I will put 0.14.1 back on it if there is any form of logs or anything I can provide.

To Reproduce Bug

Update to 0.14.1 Press most any button in the interface.

Expected Behavior

I would have expected it not to crash.

Install Method

Binary from WLED.me

What version of WLED?

WLED 0.14.1

Which microcontroller/board are you seeing the problem on?

ESP8266

Relevant log/trace output

No response

Anything else?

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

Trevo525 avatar Jan 14 '24 23:01 Trevo525

In the FWIW department, I'm also seeing this same behavior in Athom bulbs as well. (I'm using the recommended ESP02 image, happens across all bulb models.) In case it helps, I noticed this issue started in 0.14.1-B3 and did not occur in 0.14.1-B2, at least in my case. I figured this might have been related to the JSON buffer lock issue, but it looks like not. I can trigger it by changing profiles, either via the web interface or via Home Assistant. I don't believe it's configuration related as I tried a full factory reset in B3.

AKHwyJunkie avatar Jan 15 '24 01:01 AKHwyJunkie

Same with 8266. Continuously goes to Unavailable

Screenshot_20240115-064150_Home Assistant

chertvl avatar Jan 15 '24 04:01 chertvl

Have the same problem. Just updated through Home Assistant, and have the same symptoms as OP.

AngusMcT avatar Jan 15 '24 04:01 AngusMcT

Please remove Home Assistant integration and see if the problems persist. If they don't you may want to upgrade to ESP32 or get a special build without various features to get more free RAM on ESP8266.

BTW one way to see if WLED restarted is in Info dialog, Uptime field.

blazoncek avatar Jan 15 '24 06:01 blazoncek

I do not use esp8266 ( 4MB , 2MB or 1MB ) in production setup but i do have a lot of them around to replicate such issues . If cfg.json and preset.json are provided then we could do so .

I have flashed two esp8266 4MB units since the first hour of 0.14.1 release and kept them with debug bins , i did not notice anything strange nor seen disconnection/reboot/crash in the log .

As of 1 hour ago i have added one of them to HA with a simple automation ( to actually only send alert if the unit is on/off ) and i can see the unit disconnecting from wifi ( ping is lost ) but could not get it to constantly behave the same way .

I blame HA integration but can not confirm

dosipod avatar Jan 15 '24 06:01 dosipod

@chertvl down-voting will not help resolving the issue.

blazoncek avatar Jan 15 '24 07:01 blazoncek

Running fine on ESP32 S2 mini, will test on a esp8266 device later when I can.

Doyle4 avatar Jan 15 '24 12:01 Doyle4

@chertvl down-voting will not help resolving the issue.

Nevermind. Already downgraded to 0.14.0 and thats works perfectly.

About "not help resolving issue", its:

  • Advise to change the electronic component of the device, without thinking that this is a ready-made factory device where this is impossible
  • Sin on integration. Which was first deactivated during debugging.
  • Advise not to use usermods. But they don’t exist anyway. In my case, this is a regular clean 0.14.1, which was updated via HA. and HA does not know how to update firmware with usermods. If I'm not mistaken....

I now have more time to describe the symptoms. After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

  • The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.
  • APIs do not work, including HA integration.
  • It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.
  • The router reports that the device is connected, the uptime is stable, there are no reconnections.

chertvl avatar Jan 15 '24 16:01 chertvl

Same here, updated 3 8266-based devices. They can’t be accessed via Web.

mxilievski avatar Jan 15 '24 21:01 mxilievski

How many LED's you guys using? Flashed a couple esp8266's from B3 to released 0.14.1, no more than 100 led's working fine, BUT I don't use H.A at all so I can't help on that side sorry.

Doyle4 avatar Jan 16 '24 02:01 Doyle4

Same problem on 4 instances. Between 80 and 278 LED on WEMOS D1 Mini (8266). Even an update no longer works without any problems OTA, I had to flash 3 instances via USB. Apparently, the update runs into a timeout.

photobix avatar Jan 16 '24 07:01 photobix

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

WarC0zes avatar Jan 16 '24 10:01 WarC0zes

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

mxilievski avatar Jan 16 '24 10:01 mxilievski

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

I downloaded the firmware (.bin) in version 0.14.0. After you connect to the esp through the browser. In setting / security and update, and click on manual OTA update. wled update You select the firmware and update.

WarC0zes avatar Jan 16 '24 10:01 WarC0zes

I now have more time to describe the symptoms. After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

  • The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.
  • APIs do not work, including HA integration.
  • It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.
  • The router reports that the device is connected, the uptime is stable, there are no reconnections.

@blazoncek a few thoughts on commonalities in user reports

  • Its seems to only affect 8266 ("Running fine on ESP32 S2")
  • the only real change for 0.14.1 is the modified locking mechanism for WebSocket API
  • some people said that problems disappeared with -DWLED_DISABLE_WEBSOCKETS
  • some problems include WDT reset (watchdog = potential infinite loop)
  • also web responses are sometimes affected ("takes ages")

We have to remember that WS responses are not running in arduino context; on esp32 they run inside the async_tcp task, not sure how its implemented on 8266.

I think there are a few dangerous lines in the code to lock the JSON buffer

https://github.com/Aircoookie/WLED/blob/a4a8e2614ea2b8479bb33fc53ac8ca2912f9df2c/wled00/util.cpp#L205

  • delay() does work on esp32, however is dangerous on 8266 when not in arduino context
  • on 8266, millis() does not advance outside of arduino context

@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it with

    if (jsonBufferLock) return false;

its a temporary hack and not a proper solution, but it should help to understand if using delay() and millis() on 8266 is the problem. If this hack helps, then I'll take some time the next days to implement a proper solution for requestJSONBufferLock() without busy-waiting.

softhack007 avatar Jan 16 '24 10:01 softhack007

🔺 On a different topic that goes to all who commented and contribute to this thread:

Please stop this thumbs-up thumbs-down BS. We are trying to analyse a problem and need you as users who must help us. It does not really help if you just express fuzzy feelings with thumbs.

image

image

We are trying to do engineering work here, not to entertain fans in the roman circus.

  • In case you want to add your few cents, please write a sentence in Englisch, following basic rules of grammar.
  • If someone wants to say that he cannot even disable HA integration for a test, please write that.
  • a written "same here, too" is a lot easier to understand, instead of giving a thumbs-up to "same here".

I'm really tired of playing guessing games with emoji.

Use words, instead of throwing tags onto the wall. please.

softhack007 avatar Jan 16 '24 12:01 softhack007

I noticed this same behavior on my athom rgbw controller which is paired to home assistant.

After upgrading earlier in the afternoon everything seemed fine, but when I went to turn my lights off I noticed the wled controller wasn't responding. I tried a few times to turn them off via home assistant, and somehow got it stuck in a reboot loop that caused the leds to blink off every 30 seconds or so.

I was able to stop this by turning them off via the web UI and reverted to 0.14.0 and it's working again.

asolochek avatar Jan 16 '24 16:01 asolochek

@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it

Thanks for the detailed explanation. I tried to compile the firmware for the first time using these instruction at https://kno.wled.ge/advanced/compiling-wled/

I followed your steps, commented out the required line, and added a new one. It seemed like I did everything right, but, unfortunately, it didn’t help. The web interface still cannot load properly, or does not load at all. Sometimes it’s possible to view the status via JSON. The physical button control on the board works. The behavior has not changed. ps: HA integration was disabled before all of these.

Below are some screenshots:

image image image image image image

chertvl avatar Jan 16 '24 16:01 chertvl

unfortunately, it didn’t help.

It may have gotten worse. Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused. Last time I miraculously succeeded, but now I don’t.

Unfortunately, my device doesn't have a UART, and I don't have one at home either. So continue the tests without me until I find a UART to restore the device... Thanks for understanding.

chertvl avatar Jan 16 '24 17:01 chertvl

Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused. So continue the tests without me until I find a UART to restore the device... Thanks for understanding.

Thanks for helping as much as you could 🥇 and sorry about making it worse for you.

About the UART: if gpio 1 and 3 are accessible on your board, then a standard "USB-to-TTL" adapter is all you need. Like this one that's using a CH340G: https://amzn.eu/d/fZChiyZ

... or this one that's specificially made for "ESP-01S" https://amzn.eu/d/2CEAFUb

You'll also find them for cheap on ali.

softhack007 avatar Jan 16 '24 17:01 softhack007

* the only real change for 0.14.1 is the modified locking mechanism for WebSocket API

There were more changes than this. And it is not for websockets but for HTTP requests. Foremost we added PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48 to circumvent full IRAM condition. This may cause slowness in non LED display functions. Mode blending was introduced in 0.14.1-a1. It can use a lot of memory and CPU on its own.

IMO, and my own testing showed that, new locking mechanism only improved on stability and memory corruption.

* some people said that problems disappeared with -DWLED_DISABLE_WEBSOCKETS

Websockes need plenty of heap. Constantly. Disabling them can only improve things at the expense of stale UI.

* some problems include WDT reset (watchdog = potential infinite loop)

I've seen WDT in non-WLED code. How to avoid it? Have no clue. Async* stuff (web server and TCP and UDP) are interrupt driven on ESP8266.

* also web responses are sometimes affected ("takes ages")

This may be attributed to a more susceptible WiFi code in newer Arduino core we use with 0.14 (I've posted my own experience in another issue detailing the resolution).

All in all, IMO if you want to run 0.14.x on ESP8266 you need to make a few compromises. Why? Because with only 16kB of RAM available (after boot) it can get crowded rather quickly in the heap.

I am going to post my own ESP8266 configuration I use on ESP01 devices which I have plenty in daily use. Unfortunately that configuration may not work for some people as it strips quite a few features out, but produces reliable and working ESP8266 environment.

[env:esp01_4m]
extends = env:esp01_1m_full
board_build.filesystem = littlefs
board_build.ldscript = ${common.ldscript_4m1m}
board_build.f_cpu = 160000000L
build_flags = ${common.build_flags_esp8266}
  -DPIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48
  -D LED_BUILTIN=2
  -D WLED_DISABLE_ALEXA
  -D WLED_DISABLE_HUESYNC
  -D WLED_DISABLE_LOXONE
  -D WLED_DISABLE_ADALIGHT
  -D WLED_DISABLE_MQTT
  -D WLED_DISABLE_2D
  -D WLED_DISABLE_PXMAGIC
  -D WLED_USE_UNREAL_MATH
  -D WLED_MAX_BUSSES=2
  -D LEDPIN=2
  -D USERMOD_PIRSWITCH
  -D PIR_SENSOR_PIN=3
  -D PIR_SENSOR_OFF_SEC=60
  -UWLED_USE_MY_CONFIG

My ESP01 use 4MB flash so they can be updated OTA.

If we explore the possibility to swap ESP8266 (in Wemos D1 mini format) with alternate (cheap) device (which I also did) I would recommend Lolin ESP32-S2 D1 mini with 4MB flash and 2MB PSRAM. I've also posted build environments for that elsewhere but the stock WLED doesn't differ much.

And for clarification I will not pursue resolving this issue any more since ESP8266 just does not have enough resources to run smooth everything 0.14 offers. If anybody insists on running fully built 0.14 with external system like Home Assistant, Alexa or Hue and MQTT, I would urge them to reconsider and build special version with other features stripped away.

blazoncek avatar Jan 16 '24 21:01 blazoncek

@blazoncek thanks for your thoughts, and I completely forgot about "Mode blending" and other additions that really increase RAM and CPU needs.

It seems my idea about requestJSONBufferLock() did not improve it. So agreed, it could be a general issue with low RAM. Even when users see free RAM, it might be fragmented heavily - I've seen examples where the largest availeable block was less than 10% of total free space.

Guess that we need serial monitor logs from debug builds, to find out if something can be done to improve 8266 performances - or maybe nothing can be done, and we'll soon declare 8266 as "half-dead" 😉 aka deprecated....

Edit: a few more "disable" flags to try out:

  • -D WLED_DISABLE_ESPNOW
  • -D WLED_DISABLE_WEBSOCKETS
  • -D WLED_DISABLE_MODE_BLEND

.... and a simple one: go to LEDs settings, uncheck "Use global LED buffer"

softhack007 avatar Jan 16 '24 21:01 softhack007

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

If you are not using GPIO1 or GPIO2 or GPIO3 for digital led output then CPU has to keep feeding LEDs. This in turn reduces performance for everything else.

If you use PWM LEDs make sure you only use GPIO4 or GPIO12 or GPIO14 or GPIO15 (as specified by Espressif technical documentation, https://www.espressif.com/sites/default/files/documentation/esp8266-technical_reference_en.pdf). Do not forget PWM signal requires NMI to be driven, hence uses CPU.

blazoncek avatar Jan 17 '24 06:01 blazoncek

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

My test case here is a single strip of 110 WS2812Bs, using a 0_15 branch derived build. Bit-banging for this many LEDs can take several milliseconds with interrupts disabled, which I believe can overflow some of the wifi hardware queues, depending on the amount of traffic on the network. I'm working on hacking some of the interrupt tolerance ideas from FastLED in to NeoPixelBus to see if I can mitigate it.

If a setup has more LEDs on a bit-banging pin, or a busier network, it might trip problems sooner. Sometimes this might manifest as hard reboots like I'm seeing; it's also possible it manifests as a wifi disconnect. (I'm actually rather suprised I haven't seen that in my testing, to be honest).

I will try a 0.14.1 build tonight and see if it behaves differently for me than the 0_15 development branch. It's quite possible this is a different issue than the one I've been chasing.

willmmiles avatar Jan 17 '24 15:01 willmmiles

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

FWIW, I'm seeing occasional resets on 8266 with 0.14.1 and use LPD8806, so no bitbanging involved. (But it's way rarer than what people are reporting here, I have 48h uptime right now)

afflux avatar Jan 18 '24 07:01 afflux

use LPD8806, so no bitbanging involved

how do you know it is not? If you are using GPIO13 & GPIO14 then yes it uses HW to accelerate output otherwise you are using SW (CPU) to drive clock and data.

blazoncek avatar Jan 18 '24 08:01 blazoncek

how do you know it is not?

Because I explicitly checked the source when I set it up, and therefore assigned data to GPIO13 and clk to GPIO14.

afflux avatar Jan 18 '24 17:01 afflux

Because I explicitly checked the source when I set it up, and therefore assigned data to GPIO13 and clk to GPIO14.

Good. Now try to catch crash dump on serial if you can. Please.

blazoncek avatar Jan 18 '24 17:01 blazoncek

I'll have a look. Do I need to compile a debug build for this, or will the normal esp8266 crashdump suffice?

afflux avatar Jan 18 '24 17:01 afflux

That would be best. And please add exception decoder to build environment.

blazoncek avatar Jan 18 '24 17:01 blazoncek