zos
zos copied to clipboard
Error deploying zdb - zdb.sock connection refused
I've been doing some testing with zdb deployments. Node 12 on mainnet was working fine for me yesterday, but today I see this error about connection refused from the zdb socket:
"failed to deploy zdbs on node 12: error waiting deployment: workload testtest0 within deployment 532401 failed with error: failed to create zdb namespace: failed to connect to 0-db: a5b6bd02-***: dial unix /var/run/zdb_a5b6bd02-***/zdb.sock: connect: connection refused"
I checked the node logs, and didn't find any additional messages or errors that seemed like they would be helpful.
node 2 also on testnet has the same issue
I've been seeing this more. Just now on mainnet nodes 1 and 12 as well.
Same problem happened with on: node : 11 network: dev 1- At first, I suspected there might be a problem with how ZDB instances are provisioned. However, the provisioning logic is correct: it looks for a ZDB instance that does not have a namespace matching the one being provisioned. If no suitable instance exists, it tries to find a volume without a ZDB instance and uses that instead.
To validate this, I deleted all existing instances and retried the operation — but failures still happened.
Thus, point one is confirmed not to be the source of the problem.
2- Next, I suspected an issue with the flistd service.
I removed the old flistd service and replaced it with one copied from a working device (where ZDB deployment succeeds). However, even with the new flistd, the failure persisted — happening exactly three times (matching the number of available volumes).
then the original error
also for flistd logs looks suspicious
also
usually this folder with same name have the mount like this (screenshot from the working device)
3- I also considered the possibility that the device might not be catching updates properly. However, it appears to be syncing updates correctly:
{"name":"development","target":"tf-autobuilder/tags/0536941","type":"taglink","updated":1744886331,"md5":""}
Thus, the issue is not related to update consistency.
the problem isn't consistant one observation is that it mostly happen with devices that is up for a very long time (maybe more that 7 months from now)
After extra investigation with @Omarabdul3ziz We reached the main problem was :
- The current ZDB mounting process was incorrect and not properly handled in this scenario.
- The logic used to determine whether a ZDB instance can be used was missing some necessary deeper checks.
as a solution:
- Add additional logic to verify that the ZDB mount point is correctly mounted.
- Implement cleanup logic to remove any dummy ZDB instances.
- Introduce zdb destroy logic to use If no active zdb instance destroy service running:
- Remove network namespace and interface related to zdb
- deleted flistd mount
- Clean dummy zdb instances
Hi @Nabil-Salah, thanks for the work on this. Any update on when we might be able to get the PR completed to fix this?
I tried to reproduce the issue on devnet but was not able to reproduce on node 14 for example i tried to deploy/remove some instances @scottyeager can u verify if this still happening on devnet or not ? we did some fixes/enhancements in the current milestone which is deployed on devnnet now also the draft pr from Nabil is merged with other prs regarding fixing mounting/unmounting of flist so plz if u r able to reproduce it on devnet tell us here