thingsboard-edge [Question] ThingsBoard Edge PE disconnects from cloud

Component

ThingsBoard Edge PE

Description I am using ThingsBoard PE Perpetual license with ThingsBoard Cloud Maker. The issue is, Edge status is shown offline at the cloud. Below are some of the symptoms I have observed so far:

On Edge, the status is showing "connected".
On Cloud, Edge "active" status is "false".
Edge is able to send telemetry data to Cloud which can be seen under Devices and Dashboards.
Any update done on Cloud is not synchronizing with Edge. For example, if we update the Edge dashboard from Cloud, it will not update dashboard locally at Edge.
Cloud is unable to send RPC requests to Edge for Devices controlling while Edge can directly send RPC requests to Devices.
When we try to use "Sync Edge" option, it gives the error at Cloud with message "Edge is not connected".
Edge does not show activity status at Cloud unless we manually restart the ThingsBoard Edge instance.
A failure message is observed in Edge PE Docker Container logs mentioned below. tb-edge | 2023-05-18 10:16:52,867 [tb-rule-engine-consumer-47-thread-7 | QK(Main,TB_RULE_ENGINE,system)-2] INFO o.t.s.s.q.DefaultTbRuleEngineConsumerService - Failed to process [2] messages

Below are the screenshots of Edge activity status from Cloud and Edge.

Edge activity status from cloud
Edge activity status from Edge

Questions

Does anyone know about the possible causes of this issue?
How can we make sure to avoid this issue in a production/real-time environment?

Environment

OS: Ubuntu 22.04
ThingsBoard: 3.4.3PAAS
ThingsBoard Edge PE: 3.4.3EDGEPE
Docker Engine: V23.0.1
Docker Compose: v2.16.0

May 18 '23 09:05 akseerali

Hello @akseerali,

To fully understand the issue you're experiencing, we would need some additional information. Could you please provide the complete log from your ThingsBoard Edge container? Additionally, if you could attach your docker-compose.yml file, it would be very helpful.

This additional information is crucial because, without a comprehensive log analysis, determining the root cause of your problem is challenging. Thank you in advance for your cooperation!

May 22 '23 06:05 volodymyr-babak

Hi @volodymyr-babak

Please find attached the docker-compose configuration and edge log file.

tb-edge.log docker-compose.txt

May 22 '23 10:05 akseerali

Hi @volodymyr-babak, I have observed another issue that might be related to this problem. Today the Cloud is unable to send the RPC requests to Devices connected to Edge even though the edge is connected. The rule chain message shows "NO_ACTIVE_CONNECTION". Please see the screenshot below.

I have tried to unassign and then assign all the users to Edge, but the issue persists. Link

May 23 '23 19:05 akseerali

@akseerali

could you please check if you see RPC Call event in the Downlinks tab of the edge entity:

May 24 '23 12:05 volodymyr-babak

Hi @volodymyr-babak

No, it's not showing RPC call in the Downlinks section.

May 24 '23 12:05 akseerali

@akseerali

It seems like you're using the cloud version of ThingsBoard along with a ThingsBoard PE Edge license. As such, you should have access to our ThingsBoard Customer Portal, available at https://thingsboard-portal.atlassian.net/browse/CP.

As the troubleshooting of this issue may require additional private information from you, I would suggest continuing our investigation on this closed portal to ensure your data privacy.

Please note, if the root of the issue turns out to be a bug within our platform, we will ensure to update this GitHub ticket with that information. This way, our broader user community can also benefit from the findings of our investigation.

Looking forward to assisting you further on the ThingsBoard Customer Portal.

May 24 '23 12:05 volodymyr-babak

Thanks a lot @volodymyr-babak for the support. Our team will now go with the Customer Portal. Please note that in the docker compose file attached in link, I have add mentioned the additional volumes part by mistake. Please find attached the docker-compose file configuration used for the setup.

Extra configuration volumes: /media/iiotedge/sshd/tb-edge/.mytb-edge-logs:

docker compose updated.txt

May 24 '23 12:05 akseerali

@akseerali

Thank you for providing the updated docker-compose file and the previous logs. I've reviewed the information, but the root cause of the disconnection issue is not immediately clear to me.

However, it's possible that the disconnections may be related to an issue that we've recently addressed and fixed in our latest release: https://github.com/thingsboard/thingsboard/pull/8346

We just updated our cloud to the 3.5 release yesterday, and the 3.5 Edge version will be publicly available today. We'll also update the documentation on our website accordingly.

Once these updates are live, I would kindly ask you to upgrade your version to 3.5.0 and monitor the behavior. If my assumption is correct, this upgrade should resolve the disconnection issues and you should no longer see the disconnects in your logs.

Please let us know if you continue to experience problems after this update. We are committed to ensuring the smooth operation of our service for your needs.

May 24 '23 13:05 volodymyr-babak

Hi @volodymyr-babak,

We upgraded the TB Edge to version 3.5; however, this did not resolve the issue of sending the RPC request to Edge from Cloud. After this, we re-assigned the Devices group to edge and it worked. I think the upgrade of Edge instance also played its part because I had tried the same method with Edge version 3.4.3.

Regarding the disconnection/synchronization issue of edge with cloud, we'll continue to observe it for more days.

Many thanks.

May 26 '23 13:05 akseerali

Hi @volodymyr-babak,

The NO_ACTIVE_CONNECTION RPC call to Device error appeared again when we tried to send the server RPC requests to Edge today. The issue is once again cleared after re-assigning the Devices group to Edge.

May 31 '23 09:05 akseerali

Hello @akseerali,

I appreciate your patience as we work to resolve your issue.

To aid in our troubleshooting, could you please verify whether you can observe the RPC Call event under the Downlinks tab of the Edge entity? I'm currently trying to ascertain whether the issue originates from the Edge or if it lies within the cloud's capability to send the RPC Call event to the Edge.

For further investigation, I'll be running my own Edge demo overnight in an attempt to replicate the issue locally. I'm currently hypothesizing that the problem might be associated with the device session timeout. After a certain period, the cloud may begin to send RPC requests under the assumption that the device is directly connected to the cloud and not interfacing via the Edge.

I will share my findings and any potential solutions as soon as I have more information. In the meantime, I encourage you to check for the RPC Call event, as mentioned earlier, and report any findings.

Thank you for your understanding, and I look forward to resolving this issue promptly.

May 31 '23 09:05 volodymyr-babak

Hi @volodymyr-babak,

Thanks for the information and efforts. I have double-checked the Downlinks tab under the Edge details option, and no RPC Call Event action was observed due to this error until the Devices group was re-assigned to Edge instance. You may be right, the issue can be related with session.

Please let me know in case of any findings. Many thanks

May 31 '23 10:05 akseerali

Hi @akseerali,

I have a few clarifying questions that could help us diagnose this issue more effectively.

Firstly, do you have a single Edge entity in your system, or are there multiple ones? If there are multiple Edge entities, could you please verify if your device belongs to a group that is assigned exclusively to a single Edge entity? Additionally, it would be beneficial to ensure that this device doesn't belong to any other group that could potentially be assigned to another Edge.

These steps will help us isolate the problem more accurately. Looking forward to your response.

Jun 06 '23 12:06 volodymyr-babak

Hi @volodymyr-babak,

We have only one edge entity in our system and the device is only assigned to this edge. I have few other observations regarding the error.

In our case, one Device is directly connected to Edge. The RPC NO_ACTIVE_CONNECTION error was appearing when we were assigning the Device Profile of type Default to that Device. This is probably due to the session timeout.

I have now changed the Device Type to MQTT 2-3 days ago and so far no RPC error is appearing. Please see the attached diagram of system architecture. One more thing, this issue only appeared after the update of Cloud version. I'll continue to observe it after the changings. Many thanks

Jun 06 '23 14:06 akseerali

Hello @volodymyr-babak,

Today the postgres container is showing an error after updating and upgrading some file in the Ubuntu system. Could you please mention how to clear this issue? I have restarted the container, but the issue persists. Please see the logs below.

`PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-06-21 12:43:05.832 IST [1] LOG: starting PostgreSQL 12.14 (Debian 12.14-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit 2023-06-21 12:43:05.832 IST [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2023-06-21 12:43:05.833 IST [1] LOG: listening on IPv6 address "::", port 5432 2023-06-21 12:43:05.834 IST [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2023-06-21 12:43:05.845 IST [27] LOG: database system shutdown was interrupted; last known up at 2023-06-21 11:35:41 IST 2023-06-21 12:43:05.939 IST [27] LOG: invalid primary checkpoint record 2023-06-21 12:43:05.939 IST [27] PANIC: could not locate a valid checkpoint record 2023-06-21 12:43:06.031 IST [1] LOG: startup process (PID 27) was terminated by signal 6: Aborted 2023-06-21 12:43:06.031 IST [1] LOG: aborting startup due to startup process failure 2023-06-21 12:43:06.032 IST [1] LOG: database system is shut down iiotedge@OptiPlex-5060-C125:~$

`

Jun 21 '23 11:06 akseerali

Hello @akseerali,

Are these the complete logs for the PostgreSQL container? If not, could you please provide the full logs for a more comprehensive overview?

Additionally, could you clarify the exact steps you've undertaken when you refer to 'updating and upgrading some file in the Ubuntu system'? Providing these details will allow for a more accurate analysis and assist in identifying the issue at hand.

Thank you.

Jun 21 '23 11:06 volodymyr-babak

Hi @volodymyr-babak,

Please find attached the postgres container logs. postgres-container-logs.txt

I have noticed that an old postgres container (used for upgrading the PE Edge from 3.4 to 3.5) was somehow started. I have now stopped the container. Below are the commands used in Ubuntu system.

sudo apt-get update sudo apt-get upgrade

Jun 21 '23 13:06 akseerali

Hi @volodymyr-babak,

Please let me know if the use of backup database can fix this issue. The backup was saved during the upgrade of Edge instance.

Jun 21 '23 20:06 akseerali

Hello @akseerali

according to Postgres container logs, checkpoint file is corrupted and postgres is not able to start because of this.

https://sysopspro.com/fix-postgresql-error-panic-could-not-locate-a-valid-checkpoint-record/

According to this article, you will need to login into postgres container and reset log file by exiting command:

/usr/bin/pg_resetxlog -f /path/to/pg/data/directory

Please try this and let me know your results.

Jun 22 '23 04:06 volodymyr-babak

Hi @volodymyr-babak

Since the postgres container was restarting after every few seconds, login into container was not possible. I created a temporary container to reset logs as per steps in below figure. This did not resolve the issue.

From this topic, I have found a way to reset the Postgres database log file in a docker container. Please see the steps below.

The above method cleared the log error, but now there are some other errors observed in the Postgres container. The Edge is also not working properly. Please see the attached Edge and Postgres container logs. edge-logs.txt postgres-logs.txt

I think there is an issue with database. Please let me know if I can just use previous backup or create a new database to clear the issue. The PE Edge is newly deployed, so the old data is not an issue. Thanks

Jun 23 '23 00:06 akseerali

Hello @akseerali

indeed looks some database issue and some files/permissions are corrupted. Please let me know how did you do backup of your database before upgrading? What was the command? How your back looks in terms of folders - what's inside that folders? Thanks.

Jun 23 '23 05:06 volodymyr-babak

Hi @volodymyr-babak

I have followed these instructions to backup the database and the command used is mentioned below. sudo cp -r ~/.mytb-edge-data/db ~/.mytb-edge-db-BACKUP

Below is the screenshot of database folder. rHrOT8W4fz

Jun 23 '23 07:06 akseerali

thanks for the provided information.

in this case you can try to do the following:

do backup pf your current broken folder, just in case

sudo cp -r ~/.mytb-edge-data/db ~/.mytb-edge-db-BACKUP-BROKEN

remove your current data folder

sudo rm -rf ~/.mytb-edge-data/db

copy your previous backup into data folder

sudo cp -r ~/.mytb-edge-db-BACKUP ~/.mytb-edge-data/db

modify your docker-compose.yml and set version of the edge to the one, that successfully worked with backup folder before update
docker compose stop
docker compose rm
docker compose up -d
docker compose logs

Once you'll do these steps, please let me know the results. But please be careful during these steps to not remove working backup, that is currently in place.

Jun 23 '23 07:06 volodymyr-babak

Hi @volodymyr-babak

Thanks for the detailed steps. The use of backup database solves the error; however, when I upgrade the edge from 3.4.3EDGEPE to 3.5.0EDGEPE or 3.5.1EDGEPE, the upgrade process shows the error. Please find attached the edge container logs when I tried to upgrade from 3.5.3EDGE to 3.5.0EDGEPE. tbedge upgrade logs.txt

With 3.4.3EDGEPE version, the instance is running like pre-upgrade time. I think the only stable way now is to use a new database and use the latest EDGEPE version.

Jun 23 '23 14:06 akseerali

Hello @akseerali,

Based on the logs, it seems the system is not upgrading from version 3.4.3 to 3.5.0 as expected. Could you please check the contents of the following file in the edge container: /data/.upgradeversion

If it's not set to 3.4.3, please adjust it to reflect 3.4.3 and initiate the upgrade procedure following the steps provided here: https://thingsboard.io/docs/user-guide/install/pe/edge/upgrade-instructions/#docker-linux-mac-35

Jun 26 '23 10:06 volodymyr-babak

Hi @volodymyr-babak

Thanks. After changing 3.5.0 to 3.4.3 in the /data/.upgradeversion file inside the Edge container, edge is finally upgraded with new version. The new version also solves the edge connectivity problem, so I am closing this issue.

Thanks again

Jun 26 '23 13:06 akseerali

Hi @volodymyr-babak

A TB Edge PE synchronization issue is observed on 01/07/2023.

At cloud, the edge is showing to be connected, but its not sending any updates to the Downlinks section. I have also tried to send the RPC requests, but no message is showing in this section.
Furthermore, from the edge, three Devices are sending the data to cloud; however, only one Device is showing Active while the other two are showing Inactive even though these Devices are also sending telemetry data to the Cloud.

At edge, its showing "connected" under the "status" section. But its showing "disconnected" when I use a widget to monitor edge connection on a edge dashboard. This is probably because the cloud is not sending any updates to the edge instance.
After I restarted the Edge docker container, the edge connectivity issue was resolved; however, I am still seeing only one Device active at Cloud and there is no RPC response as well. For this issue to resolve, I had to re-assign the Devices group from Cloud and again restart the edge container.

Question: How to avoid this kind of issue in a production environment in future? Please find attached the edge container logs. Thanks

Edge version 3.5.1EDGEPE tb-edge.log

Jul 01 '23 19:07 akseerali

Please note, after some time, TB Cloud is again showing that only one Device is active despite of receiving the telemetry data of other Devices from the Edge instance.

Jul 01 '23 19:07 akseerali

Hey @akseerali ,

I noticed errors in the logs that could be a major communication bug in the most recent release:

2023-07-01 17:00:08,479 [cloud-manager-71-thread-1] INFO  o.t.s.s.cloud.CloudManagerService - Resetting seqIdOffset - new cycle started
2023-07-01 17:00:08,482 [cloud-manager-71-thread-1] WARN  o.t.s.s.cloud.CloudManagerService - Failed to process messages handling!
java.lang.IndexOutOfBoundsException: Index: -1
	at java.base/java.util.Collections$EmptyList.get(Collections.java:4483)
```	
and

2023-07-01 16:58:46,293 [grpc-default-executor-66] ERROR o.t.s.s.cloud.CloudManagerService - [dd8f4df7-bcbe-b548-70f5-bc2b400fb8a4] Msg processing failed! Error msg: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition.


I plan to investigate these issues and prepare a hotfix for the 3.5.1 release. I'll update this ticket as soon as the hotfix is ready. My goal is to have the hotfix released by tomorrow.

Jul 03 '23 13:07 volodymyr-babak

The hotfix for the Community Edition, CE 3.5.1.1, has been completed and released. You can find it at this link: https://github.com/thingsboard/thingsboard-edge/releases/tag/v3.5.1.1

The specific commit that addresses the IndexOutOfBoundException issue can be found here: https://github.com/thingsboard/thingsboard-edge/commit/db947eccd63e4b9d498d37c213c95d9f73a2124c

The Professional Edition hotfix, PE 3.5.1.1, is on its way and will be available soon.

Jul 04 '23 12:07 volodymyr-babak

thingsboard-edge thingsboard-edge copied to clipboard

[Question] ThingsBoard Edge PE disconnects from cloud

thingsboard-edge
thingsboard-edge copied to clipboard