esp-mesh-lite icon indicating copy to clipboard operation
esp-mesh-lite copied to clipboard

Mesh nodes stuck on channel(s) stopping mesh fusion from happening(AEGHB-1059)

Open BR-Coding-cmd opened this issue 7 months ago • 7 comments

Checklist

  • [x] Checked the issue tracker for similar issues to ensure this is not a duplicate.
  • [x] Provided a clear description of your suggestion.
  • [x] Included any relevant context or examples.

Issue or Suggestion Description

Using IDF v5.3.2 mesh_lite v1.0.2

Using local_control example

I'm expecting that after changing from the default config of mesh_lite to;

esp_mesh_lite_config_t mesh_lite_config = ESP_MESH_LITE_DEFAULT_INIT();
    esp_mesh_lite_init(&mesh_lite_config);
    mesh_lite_config.join_mesh_without_configured_wifi = true;
    mesh_lite_config.max_level = 4;
    mesh_lite_config.max_node_number = 0;
    mesh_lite_config.max_connect_number = 6;
    mesh_lite_config.max_router_number = 1;
    mesh_lite_config.mesh_id = 42;
    mesh_lite_config.device_category = "esp32s3_mesh_node";
    esp_mesh_lite_fusion_config_t fusion_config = {
        .fusion_frequency_sec = 60,
        .fusion_rssi_threshold = -90,
    };
    esp_mesh_lite_set_fusion_config(&fusion_config);

    esp_mesh_lite_core_log_enable(true);

    esp_mesh_lite_start();

I should only see 1 master and that after gaining an IP from the AP any node that's an orphan node (node without master/mesh), the orphan node will start the fusion process after 60 seconds.

What I currently see is 2/4 nodes connected in a mesh and 2 trace master nodes that are constantly going through fusion process. I can see that the orphaned nodes print the Test Log 2882, send join_me_request followed by ...fusion_time_stop then ...fusion_time_start. which shows that it is trying but failing when printing out that it doesn't connect.

any suggestions?

Thanks, BR

BR-Coding-cmd avatar Apr 24 '25 11:04 BR-Coding-cmd

Update:

I've reprogrammed the boards and have found that now 3/4 connect up.

The final node is stuck on channel 11, the rest of the nodes are on channel 6. My use case ideally cannot have the channel locked.

If this were to happen in the field where turning it off and on again is not possible, how will the stuck node ever join the mesh?

BR-Coding-cmd avatar Apr 24 '25 12:04 BR-Coding-cmd

update:

after leaving the 4 nodes active for approximately 40 minutes, the master had a sys_evt overflow

Image

Edit: same issue was seen on a child node on start up

BR-Coding-cmd avatar Apr 24 '25 12:04 BR-Coding-cmd

For those that want to look through the logs, I've popped them below.

orphaned_node.txt meshed_node_layer-1.txt meshed_node_layer-2.txt meshed_node_layer-2-2.txt

BR-Coding-cmd avatar Apr 24 '25 13:04 BR-Coding-cmd

after leaving the 4 nodes active for approximately 40 minutes, the master had a sys_evt overflow

I think you can change the stack size here.

CONFIG_ESP_SYSTEM_EVENT_TASK_STACK_SIZE=2304

nopnop2002 avatar May 11 '25 22:05 nopnop2002

I wouldn't have expected the necessity to alter the task stack size in basic example, but I have since changed the stack and, you're correct, I haven't seen that error.

My only worry still remains with the nodes being abandoned on different channels. My only suggestion here is to perhaps restart the scanning over WiFi channels. or by having a variable set in the background which tracks how many nodes should be in the mesh

BR-Coding-cmd avatar May 12 '25 07:05 BR-Coding-cmd

Is the SSID hidden for the device in orphaned_node.txt?

The root cause of this issue is: orphaned_node, as a root node device with a stronger RSSI, wants meshed_node_layer-1 to join it. meshed_node_layer-1 detected orphaned_node but may have failed to retrieve its SSID, resulting in the final connection failure.

tswen avatar May 12 '25 07:05 tswen

Is the SSID hidden for the device in orphaned_node.txt?

Only in the text file, the SSID and password are the same across all 4 devices. The SSID is not hidden via config

All nodes were turned on at the same time, as it would be seen in the field where the devices and power-cycling them isn't an option (or should be the last solution) should this happen during operation.

All devices were within 1m of each other and almost equidistant from the AP, for the node to chose the AP over the existing mesh network confuses me. Then again, a node 1m away could have an RSSI of -41dB and another the same distance away could have -45dB

If the issue is revolving around a "failed to retrieve packet" would the best solution here to improve the hardware antennae in efforts to boost and reinforce the signal? For this test a ESP32S3 Dev kit was used with a PCB antenna.

BR-Coding-cmd avatar May 12 '25 07:05 BR-Coding-cmd

Can this issue be stably reproduced?

tswen avatar May 23 '25 09:05 tswen

@BR-Coding-cmd

ESP-MESH-LITE takes in some information from the NVS and uses it to build the network. When executed local_control example, a log like this will be recorded.

E (812) NVS: Failed to read IP info from NVS

E (1139) [vendor_ie]: Error Get!
W (1141) [vendor_ie]: Mesh ID is not saved in flash
I (1146) [vendor_ie]: Mesh ID: 77
W (1149) [vendor_ie]: Error Get[4354]
W (1152) [vendor_ie]: Error Get[4354]
W (1156) wifi:Haven't to connect to a suitable AP now!

E (5637) [ESP_Mesh_Lite_Comm]: Error Get!
W (5640) [ESP_Mesh_Lite_Comm]: argot is not saved in flash

So, try to clear all flash.

idf.py erase-flash

nopnop2002 avatar May 23 '25 09:05 nopnop2002

Can this issue be stably reproduced?

I've not been able to reproduce recently or gathered

Thread now stale, to be revisited when Logs can be gathered

BR-Coding-cmd avatar Jun 03 '25 09:06 BR-Coding-cmd