backup to GCS and S3 never completes
I have a GKE autoPilot pod with a relatively vanilla setup with tlsTermination at Ingress and 3 pods. Everything works except for the backup feature.
I use Solr-operator 0.6.0, solr: 8.11.0 Zookeeper: 0.2.14
Following are the relevant parts of my setup:
Main yaml on S3 location:
backupRepositories:
- name: "gcs-backups-1"
gcs:
bucket: "backupbx"
gcsCredentialSecret:
name: "gcssecret1"
key: "service-account-key.json"
baseLocation: "d"
- name: "s3-backups-1"
s3:
region: "us-east-1" # Required
bucket: "solr-xsearch" # Required
credentials:
accessKeyIdSecret: # Optional
name: aws-secrets
key: access-key-id
secretAccessKeySecret: # Optional
name: aws-secrets
key: secret-access-key
My backup yaml:
apiVersion: solr.apache.org/v1beta1
kind: SolrBackup
metadata:
name: local-backup14
namespace: sop030
spec:
repositoryName: "s3-backups-1"
solrCloud: explore
collections:
- dsearch
The backup actually starts and both location (S3 and GCS) receives files as well, but after a while the backup process stops. There are no solr error messages, but this is the relevant portion of the logs of the pod which does the backup:
2023-04-08 14:42:00.734 INFO (parallelCoreAdminExecutor-19-thread-3-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr x:dsearch_shard1_replica_n1 local-backup14-dsearch8519612781285 BACKUPCORE) [c:dsearch s:shard1 r:core_node2 x:dsearch_shard1_replica_n1] o.a.s.h.IncrementalShardBackup Done creating backup snapshot at s3:/local-backup14-dsearch/dsearch/ shardBackupMetadataFile:org.apache.solr.core.backup.ShardBackupId@46768e0a
2023-04-08 14:42:01.396 INFO (qtp1306246648-19) [ ] o.a.s.h.a.CoreAdminOperation Checking request status for : local-backup14-dsearch8519612781285
2023-04-08 14:42:01.396 INFO (qtp1306246648-19) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&requestid=local-backup14-dsearch8519612781285&action=REQUESTSTATUS&wt=javabin&version=2} status=0 QTime=0
2023-04-08 14:42:01.398 INFO (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch ] o.a.s.c.a.c.BackupCmd Starting to backup ZK data for backupName=local-backup14-dsearch
2023-04-08 14:42:04.358 INFO (qtp1306246648-18) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:04.358 INFO (qtp1306246648-21) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:09.357 INFO (qtp1306246648-20) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:14.358 INFO (qtp1306246648-18) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:14.361 INFO (qtp1306246648-21) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=5
2023-04-08 14:42:16.585 INFO (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch ] o.a.s.c.a.c.BackupCmd Completed backing up ZK data for backupName=local-backup14-dsearch
2023-04-08 14:42:16.655 INFO (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper
2023-04-08 14:42:18.400 INFO (OverseerCollectionConfigSetProcessor-144115198735482887-explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr-n_0000000002) [ ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper
2023-04-08 14:42:19.358 INFO (qtp1306246648-20) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:24.359 INFO (qtp1306246648-18) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:24.360 INFO (qtp1306246648-22) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:29.359 INFO (qtp1306246648-20) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:34.360 INFO (qtp1306246648-18) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=6
2023-04-08 14:42:34.360 INFO (qtp1306246648-24) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=6
2023-04-08 14:42:39.358 INFO (qtp1306246648-21) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:42.779 INFO (qtp1306246648-18) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=3
2023-04-08 14:42:42.797 INFO (qtp1306246648-22) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=2
And this is where it seems to die:
2023-04-08 14:42:18.400 INFO (OverseerCollectionConfigSetProcessor-144115198735482887-explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr-n_0000000002) [ ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper
From there the log only has endless calls to check the requeststatus with no results:
2023-04-08 14:43:27.884 INFO (qtp1306246648-23) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=2
Additionally through normal api calls the backup and restore function works perfectly. Specifically the following runs without a hitch:
https://<solr-server>/solr/admin/collections?action=BACKUP&name=local-backup15&collection=dsearch&repository=s3-backups-1&location=/
And my main yaml in full:
apiVersion: solr.apache.org/v1beta1
kind: SolrCloud
metadata:
name: explore
spec:
backupRepositories:
- name: "gcs-backups-1"
gcs:
bucket: "backupbx"
gcsCredentialSecret:
name: "gcssecret1"
key: "service-account-key.json"
baseLocation: "d"
- name: "s3-backups-1"
s3:
region: "us-east-1" # Required
bucket: "solr-xsearch" # Required
credentials:
accessKeyIdSecret: # Optional
name: aws-secrets
key: access-key-id
secretAccessKeySecret: # Optional
name: aws-secrets
key: secret-access-key
customSolrKubeOptions:
ingressOptions:
ingressClassName: nginx
podOptions:
resources:
limits:
memory: 3Gi
requests:
cpu: 700m
memory: 3Gi
dataStorage:
persistent:
pvcTemplate:
spec:
resources:
requests:
storage: 2Gi
reclaimPolicy: Delete
solrSecurity:
authenticationType: Basic
replicas: 2
solrImage:
repository: solr
tag: 8.11.0
solrJavaMem: -Xms500M -Xmx500M
updateStrategy:
managed:
maxPodsUnavailable: 2
maxShardReplicasUnavailable: 2
method: Managed
zookeeperRef:
provided:
chroot: /explore
image:
pullPolicy: IfNotPresent
repository: pravega/zookeeper
tag: 0.2.15
persistence:
reclaimPolicy: Delete
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
replicas: 3
solrAddressability:
external:
domainName: <mydomain redacted>.xyz
method: Ingress
hideNodes: true
useExternalAddress: false
ingressTLSTermination:
tlsSecret: explore-solr-tls-letsencrypt
Thanks for providing so much information!
Does this happen after a few backups complete, or does it happen from the first backup?
It happens after first installation, when I created a collection and added a very few records. And it happens repeatedly. I.e. I started from scratch several times and bumped into the same problem with a default autoPilot GKE cluster.
So at this line, it should have finished almost all of the backup logic and the files should exist in s3/gcs. The line after that will delete any old incremental backups if necessary (which it shouldn't be, as its the first backup...).
2023-04-08 14:42:16.585 INFO (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch ] o.a.s.c.a.c.BackupCmd Completed backing up ZK data for backupName=local-backup14-dsearch
Also before that error can happen, it should save the state of the completed task. So whenever the requeststatus request is sent, it should have information to populate the result.
From there the log only has endless calls to check the requeststatus with no results:
Can you show me what this means? What does the "no results" json look like?
Have you tried running an async backup yourself?
Hello all. I am observing exactly the same symptoms described above.
Moreover, I see something that might be of interest for the investigation.
- Let's say I have a GCS bucket called
mybucket. After trying to run a backup, the files are created in the bucket as described by the OP. - my SolrCloud is called
mycloud-solr - my SolrBackup resource is called
mycloud-backup, and I configured it withlocation: mycloud-solr(the location matches with the SolrCloud name, this is just a preference). So it's defined like:
apiVersion: solr.apache.org/v1beta1
kind: SolrBackup
metadata:
name: mycloud-backup
namespace: search
spec:
repositoryName: "gcs-backups"
solrCloud: mycloud-solr
location: "mycloud-solr"
collections:
- articles
- the collection being backed up is called
articles - then, I see that the actual backup files have the path
mycloud-solr/mycloud-backup-articles/articles(so the path seems to be{location}/{solrBackupName}-{collection}/{collection} - Here is the thing: if I then connect to Solr and request:
http://localhost:8983/solr/admin/collections?action=LISTBACKUP&name=mycloud-backup-articles&location=mycloud-solr&repository=gcs-backups&gcsBucket=mybucket, I do get the backups I created listed there, even if the SolrBackup CRD still reports itself as "in progress".- note that the
name=mycloud-backup-articlescorresponds to the "directory" directly under the location
- note that the
{
"responseHeader": {
"status": 0,
"QTime": 347
},
"collection": "articles",
"backups": [
{
"indexFileCount": 122,
"indexSizeMB": 0.748,
"shardBackupIds": {
"shard1": "md_shard1_0.json"
},
"collection.configName": "articles",
"backupId": 0,
"collectionAlias": "articles",
"startTime": "2023-05-11T10:22:25.339652045Z",
"indexVersion": "9.4.2",
"endTime": "2023-05-11T10:22:42.946120826Z"
},
{
"indexFileCount": 122,
"indexSizeMB": 0.748,
"shardBackupIds": {
"shard1": "md_shard1_1.json"
},
"collection.configName": "articles",
"backupId": 1,
"collectionAlias": "articles",
"startTime": "2023-05-11T10:35:46.024110473Z",
"indexVersion": "9.4.2",
"endTime": "2023-05-11T10:35:50.499108324Z"
}
]
}