WLED ESP-NOW sync (including audioreactive) with slave search for master's channel

Attempt to resolve #4063 by forcing slave unit to scan WiFi channels for master's beacon/heartbeat.

Sync master will broadcast beacon/heartbeat every 5s on channel to which it is connected to or on which it has AP open.

Sync slave will listen on a channel for beacon/heartbeat for 10s then it will select another channel and listen there. Repeating until master is heard. When/each time master is heard it will postpone listening for 10s.

Jul 21 '24 19:07 blazoncek

@dedehai & @steveeisner you are welcome to review.

Jul 21 '24 19:07 blazoncek

ESPNow is really thrifty with bytes, it doesn't use the antenna for very long and there's no danger of flooding available bandwidth - especially with a feature someone has to to turn on. Why not set your beacon ping to 1 per sec, or even less? Then the remotes can scan channels more quickly than one per 10sec. At 10sec, it could take 2+ minutes to find & sync (worst case.)

Also, when the beacon disappears (device crashes or leaves the vicinity) the most likely scenario is that it's going to return on the same channel, right? Once a remote detects the beacon it should be very biased to listen on that channel more than any other. Imagine the scenario where someone unplugs or reflashes the master and plugs it back in a bit more than 10 seconds later. We wouldn't want all the remotes to go hunting through channels for 130 seconds before they sync again. We can probably design this to explore other channels while mostly paying attention to the same one they were using, in the case where a sync had been established.

Jul 22 '24 05:07 SteveEisner

@SteveEisner thank you for thorough review and useful feedback.

I can only agree with you on every point you provided. This POC was a product of 2h work so it is bound to be flawed. Still, it should show the basis of how to implement "searching for master" that does not require a lot of new code. 😄

Indeed I was thinking of using slaves as relays but that is a bit more involved as current sync message may span several ESP-NOW packets and each needs to be re-broadcast (possibly on several channels). In such case each packet would also need to carry master's signature so that slaves could recognise it.

As for 5s heatbeat, it was chosen random as I really have no feeling how much stress ESP-NOW causes. If it is safe to broadcast heartbeat every second then it can be done so.

Having master's channel embedded in heartbeat would only make sense if slaves would retransmit heartbeat on all channels IMO. I see no real benefit for now. I do agree, though, that additional info should be added into heartbeat packet (including channel).

FYI the version of QuickEspNow I did check with had a limitation of only 2 messages in the queue (one transmitting and one waiting). If you used send() while the second message was still in queue it would have been overwritten. I do not know if this is still the case.

Jul 22 '24 08:07 blazoncek

This POC was a product of 2h work so it is bound to be flawed. Still, it should show the basis of how to implement "searching for master" that does not require a lot of new code. 😄

No worries! I expected it to be simple given your comments in discord. Reviewing it as a prototype :)

Indeed I was thinking of using slaves as relays

I was only thinking about them possibly being relays for the beacon, rather than the full sync data. Since it is "expensive" for the master to change channels (might kill wifi, etc) and "cheap" for the remotes, it seems like once they're connected, they could every once in a while switch to another channel and broadcast the beacon, on behalf of the master. Then switch back to listen for sync packets. Having more beacons get sent would mean other remotes would find the beacon faster, but they would have to follow the beacon to the proper channel (not just stop scanning)

The downside to this is that they might miss sync packets while they're switched away. I'm imagining this is a pretty quick operation though (espnow is very fast)

As for 5s heatbeat, it was chosen random as I really have no feeling how much stress ESP-NOW causes. If it is safe to broadcast heartbeat every second then it can be done so.

I think it is - ESPNow has a 1mb/s theoretical limit, and these packets will be < 30 bytes. So no real impact on the network. And in my brach I don't detect any lag or framerate drag while doing multiple ESPnow sends per second. Channel switching, I'm not sure about.

Having master's channel embedded in heartbeat would only make sense if slaves would retransmit heartbeat on all channels IMO. I see no real benefit for now. I do agree, though, that additional info should be added into heartbeat packet (including channel).

My various ideas about the channel had to do with slightly more complex operations. For instance the relay I mentioned above. Or when the master wants to send its remotes to another channel - it could send a beacon on its current channel, directing them to the new channel. This would be useful at times when the master is changing its setting & doesn't want to lose the remotes.

FYI the version of QuickEspNow I did check with had a limitation of only 2 messages in the queue (one transmitting and one waiting). If you used send() while the second message was still in queue it would have been overwritten. I do not know if this is still the case.

I believe this, but sending is very fast and you're not likely to overwrite the queue unless you're sending very quickly or the network is very noisy. But since ESPNow is an un-acknowledged send, you'll need to assume that the message might not make it to the remotes for a variety of reasons. In my branch I have the followers rebroadcast the leader at random times. This makes them all more noisy but has eliminated message loss.

Unacknowledged isn't any different than UDP, but the effect of a missed beacon is worse than a single missed UDP sync packet. If a remote misses the beacon it might have to cycle through 13 more channels before it listens again. So I figured the best thing to do is to "flood" the channel with beacons, preferably with the help of remotes.

Jul 22 '24 09:07 SteveEisner

To cool off from legal issues, I've spent yesterday chasing the idea presented by @ChuckMash on Discord: "Use slave to ping master (unicast) on every channel and use send callback function to check for successful delivery and so determine master's channel."

While the above should work in theory I was unsuccessful in receiving any acknowledgement that master received a ping. While investigating why I noticed that WiFi channel used by slave changed very often and had to be reset on each unicast. Thinking of the reasons it may be that while the slave is in STA mode and is periodically trying to connect to WiFi it scans all channels where a possible known WiFi may be.

This renders the unicast method unusable and fixing channel in STA mode impossible.

I noticed another oddity while slave was in (temporary) AP mode. Even though the AP channel selected was 1 it did occasionally receive a beacon/heartbeat from master (mine used channel 11) so I tried to set WiFi.setMode(WIFI_IF_AP) (and same for QuickEspNow) between re-connection attempts (had to prolong them to 60s). In this case slave was successful in detecting master's heartbeat and set channel accordingly. The side effect was that in such case slave was not able to rejoin WiFi once it moved in range.

I will push new code later for you to try out and possibly spot an error on my part.

BTW connection (and re-connection) logic is convoluted beyond sanity. It would benefit if someone could make it simpler (I have difficulty following execution even after 2 years of trying to understand it; most comments in that part of code are mine while I was trying to decipher the logic behind).

Jul 25 '24 05:07 blazoncek

It may be worth pursuing raw ESP-NOW API instead of QuickEspNow library as WLED's use of ESP-NOW is limited to broadcast and known slave-to-master unicast.

Jul 25 '24 17:07 blazoncek

I noticed another oddity while slave was in (temporary) AP mode. Even though the AP channel selected was 1 it did occasionally receive a beacon/heartbeat from master (mine used channel 11) so I tried to set WiFi.setMode(WIFI_IF_AP) (and same for QuickEspNow) between re-connection attempts (had to prolong them to 60s). In this case slave was successful in detecting master's heartbeat and set channel accordingly. The side effect was that in such case slave was not able to rejoin WiFi once it moved in range.

For this part, this could be the dreaded ESP-NOW channel cross talk, perhaps have the heartbeat itself contain the channel. That way even if the message was received on the wrong channel, it contains the correct channel to use.

Jul 25 '24 19:07 ChuckMash

perhaps have the heartbeat itself contain the channel

That's how it is made now and no, there was no crosstalk encountered on my system. The WiFi.channel() reported 11 and master.channel was also 11. Somehow slave switched to channel 11 without any call to setChannel() and then reverted back or to a different channel.

Jul 25 '24 19:07 blazoncek

In order to 'judge' different approaches and get everyone on the same page: @blazoncek how about writing down some basic requirements you have in mind? or like one or two scenarios where ESP-Now sync would be used? From what I understand, the idea is to sync devices that are not connected to the 'master wifi' i.e. (some) slaves are in AP mode. You mentioned installations at Festivals like 'Burning Man' which I think is a good example. Suggestion for requriements from the top of my head: -how are devices linked to the master (MAC address, random magic key, ...) -what is acceptable as a delay to initial sync of slaves? (what are realistic scenarios where it matters?) -is the (last known) master channel saved and how is it prioritized in a sync-search (assuming slaves channel hop) after bootup?

Jul 26 '24 07:07 DedeHai

Ok, a brief description how ESP-NOW sync should work and what features should it include:

first, it has to do everything regular UDP sync does (without the need for WiFi)
it has to work even if master is connected to a WiFi while slaves are not connected to WiFi (slaves connected to WiFi will use UDP sync; if they are connected to unrelated WiFi then no sync will be performed for them)
slaves need to be configured to open temporary AP after boot if no WiFi found but can (optionally) have WiFi configured
slaves need to find/scan for master's channel on their own (as it may be connected to WiFi and hence unable to choose channel at will)
once the channel is found slave should remain locked to this channel until master's beacon/heartbeat is not heard for prolonged time
if master is not heard of, scanning for its beacon/heartbeat should resume (possibly with the last known channel)
scanning should not take more than a minute
beacon should not flood the channel
there should be no impact on WLED performance or it should be minimal (hiccups are to be avoided but drop in FPS is allowable if it is acceptable; >20FPS)
if slave joins WiFi, ESP-NOW sync should cease and resume if WiFi is lost again
ESP-NOW sync should follow packet retransmission set for UDP

Jul 26 '24 07:07 blazoncek

One more note regarding ESP-NOW and device in STA mode.

As noted in Espressif documentation (FAQ) once device switches to STA mode it can no longer change channels (looks like even if unconnected as STA mode searches for a configured SSID across channels). WLED's WiFi connection logic periodically switches to STA mode and retries connecting (by default every 18s, with this PR 30s). This behaviour will break ESP-NOW locked channel and will cause re-scan for master.

There are two possibilities: a) do not configure WiFi b) configure WiFi (and temporary AP) but prolong reconnection attempts and use AP mode for ESP-NOW

Jul 28 '24 13:07 blazoncek

@SteveEisner @ChuckMash @DedeHai this code is now test-worthy.

To enable ESP-NOW sync (with WiFi configured) do:

configure slave units to open Temporary AP when no WiFi in range
enable ESP-NOW in WiFi settings and enter master's MAC address (you can use FFFFFFFFFFFF to allow any ESP to be master) on slave units
enable ESP-NOW sync in Sync settings
configure Sync groups so that master and slaves match
enable sync events (direct change, button press, IR remote, Alexa, etc) on master
start Sync on master
move slaves out of WiFi range and start them up
wait until temporary AP disappears and within 2 minutes sync should start
slave will remain unconnected from WiFi even if brought into range until master is no longer heard; in such case WiFi search is restarted and whole process starts again

Current implementation will not work if WiFi is un-configured.

Jul 29 '24 18:07 blazoncek

Quick update ... I spent an hour tonight working with two devices and was never able to get them to sync via ESPNow.

I'm sure this is due to my own errors. I tried to follow your directions for syncing, but I don't have the ability to move my devices out of the range of wifi [my wife is using the wifi, I can't turn it off], so I tried to go into the code and get it to work without having to do that.

I can attach to the master in debug mode, and see it sending its beacon on channel 11, which is the AP channel I set in WiFi config. But I can't get the remote to begin the channel search, even by hacking the code to bypass some of the if conditions. I tried for quite a while and apparently was never able to get it in exactly the right scenario.

Even though I wasn't able to test it properly, I've spent a lot of time in this code now. TBH, I think the conditions for enabling ESPnow modes vs. WiFi are too complex as written now. I'd like to suggest a new pattern. (The following is just my opinion)

First, we should move all ESPNow handling into a singleton-instance class in its own file, in the way that WiFi does. Its use of QuickEspNow can be an internal implementation detail (perhaps to be replaced with direct espnow calls later) and the global wled variables should move into that class as public instance members that can be inspected from outside if needed.
It has a function enable(bool) that tells it whether it should be running. For now this could even be a #define (if this code exists, it is running), because this is somewhat redundant with simply setting the MAC to 00..00. But it's nice to not wipe the mac address if all you want to do is turn off sync receiving. Configuration loads and saves this value instead of a global.
EspNow upon init always sets QuickEspNow's send and receive callbacks to its own methods - even if they're never going to be invoked - because there's no harm in setting some function pointers.
When it receives a callback message, it only handles it if enabled=true. When it handles it, it invokes a usermod manager method that prioritizes calling udpsync and wizmote first and then each usermod after that. If a usermod handler method returns an indicator that it consumed the message, it stops calling subsequent registered usermods.
It has a function listen(mac) that sets its internal master variable. It's always in one of two modes: either it believes it is the master (listen=00..00), or it is listening for the master (listen=anything else) and as per current code, if listen=FF...FF it'll obey any sync packets, etc. Config code calls this method instead of setting a global variable.
It has a send-message method that udp sync will -always- (for now) call if Sync is enabled on the device. EspNow itself will figure out whether or how it's going to actually broadcast the message (ie. if it's the master and enabled and the wifi mode is valid and so on). Later this same method can be used to broadcast audio-data messages.
It has a on_wifi_event(event) event handler method. Its parameters could be roughly modeled on the available IDF wifi events or just create a simplified enum. The three initialization locations WLED:initAP, WLED::initInterfaces and WLED::handleConnection should call that method and indicate what happened (WiFi turned on/off, AP activated/deactivated, etc). Based on that, it updates its internal espnow handling. Soon we can update the class to listen to real WiFi events and delete it entirely from those callsites.
It has a tick() method that's called in the main WLED loop. Depending on its internal state, it channel hops, pauses, etc. as per your diff's code.

The idea of this pattern is to make most other code unaware of all the details of ESPNow except for the one part that they care about. (UDP)Sync just knows that it's a sync engine, and tries to use it. Config just stores its values for it. Usermods just know that they'll sometimes get ESPnow packets and should handle them as a listener would. And the rest of the logic is all in a single place where it can be reviewed & updated.

if you want I can help write this....

Jul 31 '24 09:07 SteveEisner

@SteveEisner or anyone else trying to test the PR without moving devices out of WiFi range. Add this to initConnection() just before call to WiFi.begin(...);:

#ifdef WLED_DEBUG // or something else, could remove altogether
  if (millis() < 600000) // or any other value greater than 2*WLED_AP_TIMEOUT
    WiFi.begin("unknown");
  else
#endif

This will open a 10 min window when a device will try to connect to (most likely unavailable) "unknown" SSID. During this time temporary AP will open and close. After AP closes the device will start to search for master until 10 min pass then it will try to connect to configured SSID unless master has bee found (so you can work with it as regular).

As for your proposal IDK if I understand it properly.

What I've learned during development is that channels will change without notice (unless connected to WiFi but even then a roaming device may connect to AP with different channel at any time without any notification). This will make ESP-NOW unreliable unless the channel is constantly adjusted.

Aug 07 '24 10:08 blazoncek

There seems to be an issue with WiFi re-connection after prolonged uptime. WLED will not reconnect to WiFi unless restarted if WiFi connection is lost for some time. I cannot find the cause.

Oct 28 '24 05:10 blazoncek

FYI I am continuing development of ESP-NOW sync in my own fork. I will not be updating this PR any further. The task is much more complex than initially thought as ESP-NOW and WiFi are more intertwined than what is available to read from documentation and implementing it will most likely require re-writing WiFi handling in WLED (in regards to multiple WiFi configuration).

Dec 16 '24 18:12 blazoncek

I wrote code for an ESPNow remote for WLED and noticed similar things. On top, there are issues in the core libraries: I started out with arduino, switched to IDF in expectation to gain more control but then had to switch back to Arduino as there are just too many undocumented and convoluted features, each one working on its own but if you start combining, everything starts to break or be erratic. I have it working now but I can imagine running ESPNow along wifi in STA and AP mode is a difficult task to balance, especially with the very limited and often vague documentation.

Dec 16 '24 18:12 DedeHai

Working in STA for slave is impossible as WiFi stack (if enabled to try to reconnect to SSID) will switch channels periodically when not able to connect. For master it is ok as it will switch channel only when roaming APs. Disabling auto-reconnect is IMO undesirable (at least for the purpose I want). Auto-reconnect should be as fast as possible without interfering with ESP-NOW sync and this is where things get complicated.

Working in AP mode works (kind of) but is not what I desire (as it defeats the purpose I need).

My fork has a working solution but may not be stable as I messed with WiFi reconnection logic substantially.

Dec 16 '24 19:12 blazoncek