OpenSearch
OpenSearch copied to clipboard
Adding search node to the cluster breaks snapshot repository
Describe the bug
We have a cluster of ten nodes running for a long time.
We use SMB share to store our snapshot repository. Many years it receives daily snapshots and sometimes some of them got restored successfully.
After testing the searchable snapshot feature at the test installation, we decided to implement it to our production system.
However, it turned out that adding the node.roles: [search]
node makes storage snapshot inoperable. Here is the output:
% curl logs:9200/_snapshot/searchable/_verify -XPOST | jq . { "error": { "root_cause": [ { "type": "repository_verification_exception", "reason": "[searchable] cannot delete test data at " } ], "type": "repository_verification_exception", "reason": "[searchable] cannot delete test data at ", "caused_by": { "type": "directory_not_empty_exception", "reason": "/usr/share/opensearch/searchable/tests-WuWXrwFrTd-BiVX9VfEkTw" } }, "status": 500 }
At the same time, using any node, including new search node:
% ssh <any_node> sudo docker exec <container_name> 'ls -l /usr/share/opensearch/searchable/tests-WuWXrwFrTd-BiVX9VfEkTw' total 0
Just by switching new search node off:
% curl logs:9200/_snapshot/searchable/_verify -XPOST | jq . { "nodes": { <all 8 data/master nodes here> }}
I.e. it is definitely an empty directory.
Related component
Other
To Reproduce
Just add node.roles: [ search ]
node to existing '[data]', '[master]', and '[ ]' nodes and see weird verification error.
Expected behavior
Adding search node should not affect existing cluster.
Additional Details
Plugins Security with Keycloak SAML
Host/Environment (please complete the following information):
- OS: Oracle Enteriprise Linux 8.9
Additional context Master node log at the next message.
Here is the related master node log excerpt:
[2024-04-02T17:12:40,341][WARN ][o.o.r.RepositoriesService] [v4-master.company.com] [searchable] failed to finish repository verification org.opensearch.repositories.RepositoryVerificationException: [searchable] cannot delete test data at at org.opensearch.repositories.blobstore.BlobStoreRepository.endVerification(BlobStoreRepository.java:1986) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.repositories.RepositoriesService$3.lambda$doRun$1(RepositoriesService.java:383) [opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.base/java.lang.Thread.run(Thread.java:840) [?:?] Caused by: java.nio.file.DirectoryNotEmptyException: /usr/share/opensearch/searchable/tests-evLsfSTJQAOl5JGzamyFwQ at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246) ~[?:?] at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) ~[?:?] at java.base/java.nio.file.Files.delete(Files.java:1152) ~[?:?] at org.opensearch.common.blobstore.fs.FsBlobContainer$1.postVisitDirectory(FsBlobContainer.java:141) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.blobstore.fs.FsBlobContainer$1.postVisitDirectory(FsBlobContainer.java:137) ~[opensearch-2.12.0.jar:2.12.0] at java.base/java.nio.file.Files.walkFileTree(Files.java:2828) ~[?:?] at java.base/java.nio.file.Files.walkFileTree(Files.java:2882) ~[?:?] at org.opensearch.common.blobstore.fs.FsBlobContainer.delete(FsBlobContainer.java:137) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.repositories.blobstore.BlobStoreRepository.endVerification(BlobStoreRepository.java:1984) ~[opensearch-2.12.0.jar:2.12.0] ... 5 more
To collect more information, I've set logger.level: DEBUG
at all four master-eligible nodes and restated them.
And now my repository got successfully verified...
UPD: few minutes later, it failed to verify again
UPD2: it seems that it verified successfully sometimes, 5 to 10 per cent
I wrote a simple Python program to check the contents of the snapshot directory every 0.01 seconds.
Here is it's output when search node stopped and repo verified successfully: dir-ok.txt
Here is it's output when search node connected and this error returned
"type": "directory_not_empty_exception", "reason": "/usr/share/opensearch/searchable/tests-s4d0kZgWQL6dXHjTnIcMiw" dir-fail.txt
I.e. two file are sitting here extra couple of seconds.
In case somebody interested, I managed to conquer this issue.
Our cluster is running Oracle Linux Server 8.9 with ol8_UEKR6 kernel (5.4.17). New node is running Oracle Linux Server 8.9 with ol8_UEKR7 kernel (5.15.0).
We've upgraded all our nodes to 5.15 and got backup broken (repository verify failed).
Then I've restarted all nodes at 5.4 kernel and found repo verification is working properly again.
Then downgraded new node kernel to 5.4, add it to cluster and it works properly now.
Thanks for narrowing it down! Still, what is the root cause of the verification failure?
I was unable to identify the exact cause. What I've noticed that with 5.4 kernel, we see some time discrepancy (sometimes up to 2 seconds) between system clock and hardware clock of VMs (as per 'timedatectl' ouptput). With 5.15 kernel, time is the same at system clock and hardware clock. However, with 5.4 kernel snapshot just works as it supposed to works and we are now successfully implementing the searchable snapshot feature.