thingsboard-edge
thingsboard-edge copied to clipboard
[Question] ThingsBoard Edge PE disconnects from cloud
Component
- ThingsBoard Edge PE
Description I am using ThingsBoard PE Perpetual license with ThingsBoard Cloud Maker. The issue is, Edge status is shown offline at the cloud. Below are some of the symptoms I have observed so far:
- On Edge, the status is showing "connected".
- On Cloud, Edge "active" status is "false".
- Edge is able to send telemetry data to Cloud which can be seen under Devices and Dashboards.
- Any update done on Cloud is not synchronizing with Edge. For example, if we update the Edge dashboard from Cloud, it will not update dashboard locally at Edge.
- Cloud is unable to send RPC requests to Edge for Devices controlling while Edge can directly send RPC requests to Devices.
- When we try to use "Sync Edge" option, it gives the error at Cloud with message "Edge is not connected".
- Edge does not show activity status at Cloud unless we manually restart the ThingsBoard Edge instance.
- A failure message is observed in Edge PE Docker Container logs mentioned below.
tb-edge | 2023-05-18 10:16:52,867 [tb-rule-engine-consumer-47-thread-7 | QK(Main,TB_RULE_ENGINE,system)-2] INFO o.t.s.s.q.DefaultTbRuleEngineConsumerService - Failed to process [2] messages
Below are the screenshots of Edge activity status from Cloud and Edge.
-
Edge activity status from cloud
-
Edge activity status from Edge
Questions
- Does anyone know about the possible causes of this issue?
- How can we make sure to avoid this issue in a production/real-time environment?
Environment
- OS: Ubuntu 22.04
- ThingsBoard: 3.4.3PAAS
- ThingsBoard Edge PE: 3.4.3EDGEPE
- Docker Engine: V23.0.1
- Docker Compose: v2.16.0
Hello @akseerali,
To fully understand the issue you're experiencing, we would need some additional information. Could you please provide the complete log from your ThingsBoard Edge container? Additionally, if you could attach your docker-compose.yml file, it would be very helpful.
This additional information is crucial because, without a comprehensive log analysis, determining the root cause of your problem is challenging. Thank you in advance for your cooperation!
Hi @volodymyr-babak
Please find attached the docker-compose configuration and edge log file.
Hi @volodymyr-babak, I have observed another issue that might be related to this problem. Today the Cloud is unable to send the RPC requests to Devices connected to Edge even though the edge is connected. The rule chain message shows "NO_ACTIVE_CONNECTION". Please see the screenshot below.
I have tried to unassign and then assign all the users to Edge, but the issue persists. Link
@akseerali
could you please check if you see RPC Call event in the Downlinks tab of the edge entity:
Hi @volodymyr-babak
No, it's not showing RPC call in the Downlinks section.
@akseerali
It seems like you're using the cloud version of ThingsBoard along with a ThingsBoard PE Edge license. As such, you should have access to our ThingsBoard Customer Portal, available at https://thingsboard-portal.atlassian.net/browse/CP.
As the troubleshooting of this issue may require additional private information from you, I would suggest continuing our investigation on this closed portal to ensure your data privacy.
Please note, if the root of the issue turns out to be a bug within our platform, we will ensure to update this GitHub ticket with that information. This way, our broader user community can also benefit from the findings of our investigation.
Looking forward to assisting you further on the ThingsBoard Customer Portal.
Thanks a lot @volodymyr-babak for the support. Our team will now go with the Customer Portal. Please note that in the docker compose file attached in link, I have add mentioned the additional volumes part by mistake. Please find attached the docker-compose file configuration used for the setup.
Extra configuration
volumes: /media/iiotedge/sshd/tb-edge/.mytb-edge-logs:
@akseerali
Thank you for providing the updated docker-compose file and the previous logs. I've reviewed the information, but the root cause of the disconnection issue is not immediately clear to me.
However, it's possible that the disconnections may be related to an issue that we've recently addressed and fixed in our latest release: https://github.com/thingsboard/thingsboard/pull/8346
We just updated our cloud to the 3.5 release yesterday, and the 3.5 Edge version will be publicly available today. We'll also update the documentation on our website accordingly.
Once these updates are live, I would kindly ask you to upgrade your version to 3.5.0 and monitor the behavior. If my assumption is correct, this upgrade should resolve the disconnection issues and you should no longer see the disconnects in your logs.
Please let us know if you continue to experience problems after this update. We are committed to ensuring the smooth operation of our service for your needs.
Hi @volodymyr-babak,
We upgraded the TB Edge to version 3.5; however, this did not resolve the issue of sending the RPC request to Edge from Cloud. After this, we re-assigned the Devices group to edge and it worked. I think the upgrade of Edge instance also played its part because I had tried the same method with Edge version 3.4.3.
Regarding the disconnection/synchronization issue of edge with cloud, we'll continue to observe it for more days.
Many thanks.
Hi @volodymyr-babak,
The NO_ACTIVE_CONNECTION RPC call to Device error appeared again when we tried to send the server RPC requests to Edge today. The issue is once again cleared after re-assigning the Devices group to Edge.
Hello @akseerali,
I appreciate your patience as we work to resolve your issue.
To aid in our troubleshooting, could you please verify whether you can observe the RPC Call event under the Downlinks tab of the Edge entity? I'm currently trying to ascertain whether the issue originates from the Edge or if it lies within the cloud's capability to send the RPC Call event to the Edge.
For further investigation, I'll be running my own Edge demo overnight in an attempt to replicate the issue locally. I'm currently hypothesizing that the problem might be associated with the device session timeout. After a certain period, the cloud may begin to send RPC requests under the assumption that the device is directly connected to the cloud and not interfacing via the Edge.
I will share my findings and any potential solutions as soon as I have more information. In the meantime, I encourage you to check for the RPC Call event, as mentioned earlier, and report any findings.
Thank you for your understanding, and I look forward to resolving this issue promptly.
Hi @volodymyr-babak,
Thanks for the information and efforts. I have double-checked the Downlinks tab under the Edge details option, and no RPC Call Event action was observed due to this error until the Devices group was re-assigned to Edge instance. You may be right, the issue can be related with session.
Please let me know in case of any findings. Many thanks
Hi @akseerali,
I have a few clarifying questions that could help us diagnose this issue more effectively.
Firstly, do you have a single Edge entity in your system, or are there multiple ones? If there are multiple Edge entities, could you please verify if your device belongs to a group that is assigned exclusively to a single Edge entity? Additionally, it would be beneficial to ensure that this device doesn't belong to any other group that could potentially be assigned to another Edge.
These steps will help us isolate the problem more accurately. Looking forward to your response.
Hi @volodymyr-babak,
We have only one edge entity in our system and the device is only assigned to this edge. I have few other observations regarding the error.
In our case, one Device is directly connected to Edge. The RPC NO_ACTIVE_CONNECTION error was appearing when we were assigning the Device Profile of type Default to that Device. This is probably due to the session timeout.
I have now changed the Device Type to MQTT 2-3 days ago and so far no RPC error is appearing. Please see the attached diagram of system architecture. One more thing, this issue only appeared after the update of Cloud version. I'll continue to observe it after the changings. Many thanks
Hello @volodymyr-babak,
Today the postgres container is showing an error after updating and upgrading some file in the Ubuntu system. Could you please mention how to clear this issue? I have restarted the container, but the issue persists. Please see the logs below.
`PostgreSQL Database directory appears to contain a database; Skipping initialization
2023-06-21 12:43:05.832 IST [1] LOG: starting PostgreSQL 12.14 (Debian 12.14-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit 2023-06-21 12:43:05.832 IST [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2023-06-21 12:43:05.833 IST [1] LOG: listening on IPv6 address "::", port 5432 2023-06-21 12:43:05.834 IST [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2023-06-21 12:43:05.845 IST [27] LOG: database system shutdown was interrupted; last known up at 2023-06-21 11:35:41 IST 2023-06-21 12:43:05.939 IST [27] LOG: invalid primary checkpoint record 2023-06-21 12:43:05.939 IST [27] PANIC: could not locate a valid checkpoint record 2023-06-21 12:43:06.031 IST [1] LOG: startup process (PID 27) was terminated by signal 6: Aborted 2023-06-21 12:43:06.031 IST [1] LOG: aborting startup due to startup process failure 2023-06-21 12:43:06.032 IST [1] LOG: database system is shut down iiotedge@OptiPlex-5060-C125:~$
`
Hello @akseerali,
Are these the complete logs for the PostgreSQL container? If not, could you please provide the full logs for a more comprehensive overview?
Additionally, could you clarify the exact steps you've undertaken when you refer to 'updating and upgrading some file in the Ubuntu system'? Providing these details will allow for a more accurate analysis and assist in identifying the issue at hand.
Thank you.
Hi @volodymyr-babak,
Please find attached the postgres container logs. postgres-container-logs.txt
I have noticed that an old postgres container (used for upgrading the PE Edge from 3.4 to 3.5) was somehow started. I have now stopped the container. Below are the commands used in Ubuntu system.
sudo apt-get update
sudo apt-get upgrade
Hi @volodymyr-babak,
Please let me know if the use of backup database can fix this issue. The backup was saved during the upgrade of Edge instance.
Hello @akseerali
according to Postgres container logs, checkpoint file is corrupted and postgres is not able to start because of this.
https://sysopspro.com/fix-postgresql-error-panic-could-not-locate-a-valid-checkpoint-record/
According to this article, you will need to login into postgres container and reset log file by exiting command:
/usr/bin/pg_resetxlog -f /path/to/pg/data/directory
Please try this and let me know your results.
Hi @volodymyr-babak
Since the postgres container was restarting after every few seconds, login into container was not possible.
I created a temporary container to reset logs as per steps in below figure. This did not resolve the issue.
From this topic, I have found a way to reset the Postgres database log file in a docker container. Please see the steps below.
The above method cleared the log error, but now there are some other errors observed in the Postgres container. The Edge is also not working properly. Please see the attached Edge and Postgres container logs. edge-logs.txt postgres-logs.txt
I think there is an issue with database. Please let me know if I can just use previous backup or create a new database to clear the issue. The PE Edge is newly deployed, so the old data is not an issue. Thanks
Hello @akseerali
indeed looks some database issue and some files/permissions are corrupted. Please let me know how did you do backup of your database before upgrading? What was the command? How your back looks in terms of folders - what's inside that folders? Thanks.
Hi @volodymyr-babak
I have followed these instructions to backup the database and the command used is mentioned below.
sudo cp -r ~/.mytb-edge-data/db ~/.mytb-edge-db-BACKUP
Below is the screenshot of database folder.
thanks for the provided information.
in this case you can try to do the following:
- do backup pf your current broken folder, just in case
sudo cp -r ~/.mytb-edge-data/db ~/.mytb-edge-db-BACKUP-BROKEN
- remove your current data folder
sudo rm -rf ~/.mytb-edge-data/db
- copy your previous backup into data folder
sudo cp -r ~/.mytb-edge-db-BACKUP ~/.mytb-edge-data/db
-
modify your docker-compose.yml and set version of the edge to the one, that successfully worked with backup folder before update
-
docker compose stop
-
docker compose rm
-
docker compose up -d
-
docker compose logs
Once you'll do these steps, please let me know the results. But please be careful during these steps to not remove working backup, that is currently in place.
Hi @volodymyr-babak
Thanks for the detailed steps. The use of backup database solves the error; however, when I upgrade the edge from 3.4.3EDGEPE to 3.5.0EDGEPE or 3.5.1EDGEPE, the upgrade process shows the error. Please find attached the edge container logs when I tried to upgrade from 3.5.3EDGE to 3.5.0EDGEPE. tbedge upgrade logs.txt
With 3.4.3EDGEPE version, the instance is running like pre-upgrade time. I think the only stable way now is to use a new database and use the latest EDGEPE version.
Hello @akseerali,
Based on the logs, it seems the system is not upgrading from version 3.4.3 to 3.5.0 as expected. Could you please check the contents of the following file in the edge container: /data/.upgradeversion
If it's not set to 3.4.3, please adjust it to reflect 3.4.3 and initiate the upgrade procedure following the steps provided here: https://thingsboard.io/docs/user-guide/install/pe/edge/upgrade-instructions/#docker-linux-mac-35
Hi @volodymyr-babak
Thanks. After changing 3.5.0 to 3.4.3 in the /data/.upgradeversion file inside the Edge container, edge is finally upgraded with new version. The new version also solves the edge connectivity problem, so I am closing this issue.
Thanks again
Hi @volodymyr-babak
A TB Edge PE synchronization issue is observed on 01/07/2023.
- At cloud, the edge is showing to be connected, but its not sending any updates to the Downlinks section. I have also tried to send the RPC requests, but no message is showing in this section.
- Furthermore, from the edge, three Devices are sending the data to cloud; however, only one Device is showing Active while the other two are showing Inactive even though these Devices are also sending telemetry data to the Cloud.
-
At edge, its showing "connected" under the "status" section. But its showing "disconnected" when I use a widget to monitor edge connection on a edge dashboard. This is probably because the cloud is not sending any updates to the edge instance.
-
After I restarted the Edge docker container, the edge connectivity issue was resolved; however, I am still seeing only one Device active at Cloud and there is no RPC response as well. For this issue to resolve, I had to re-assign the Devices group from Cloud and again restart the edge container.
Question: How to avoid this kind of issue in a production environment in future? Please find attached the edge container logs. Thanks
Edge version 3.5.1EDGEPE tb-edge.log
Please note, after some time, TB Cloud is again showing that only one Device is active despite of receiving the telemetry data of other Devices from the Edge instance.
Hey @akseerali ,
I noticed errors in the logs that could be a major communication bug in the most recent release:
2023-07-01 17:00:08,479 [cloud-manager-71-thread-1] INFO o.t.s.s.cloud.CloudManagerService - Resetting seqIdOffset - new cycle started
2023-07-01 17:00:08,482 [cloud-manager-71-thread-1] WARN o.t.s.s.cloud.CloudManagerService - Failed to process messages handling!
java.lang.IndexOutOfBoundsException: Index: -1
at java.base/java.util.Collections$EmptyList.get(Collections.java:4483)
```
and
2023-07-01 16:58:46,293 [grpc-default-executor-66] ERROR o.t.s.s.cloud.CloudManagerService - [dd8f4df7-bcbe-b548-70f5-bc2b400fb8a4] Msg processing failed! Error msg: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition.
I plan to investigate these issues and prepare a hotfix for the 3.5.1 release. I'll update this ticket as soon as the hotfix is ready. My goal is to have the hotfix released by tomorrow.
The hotfix for the Community Edition, CE 3.5.1.1, has been completed and released. You can find it at this link: https://github.com/thingsboard/thingsboard-edge/releases/tag/v3.5.1.1
The specific commit that addresses the IndexOutOfBoundException issue can be found here: https://github.com/thingsboard/thingsboard-edge/commit/db947eccd63e4b9d498d37c213c95d9f73a2124c
The Professional Edition hotfix, PE 3.5.1.1, is on its way and will be available soon.