backup crashes with exec hook: "E0413: Short write"
Hi!
I have just started trying out Stash, and have some issues with this BackupConfiguration:
apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
name: ss-backup
namespace: skilnet
spec:
repository:
name: gcs-repo
schedule: "@hourly"
runtimeSettings:
container:
securityContext:
runAsUser: 1000
runAsGroup: 1000
target:
ref:
apiVersion: apps/v1
kind: StatefulSet
name: skilnet
volumeMounts:
- name: wordpress
mountPath: /var/www
paths:
- /var/www/public
- /var/www/dumps
hooks:
preBackup:
containerName: php
exec:
command:
- /bin/sh
- -c
- >
mysqldump --verbose --single-transaction=TRUE --create-options db
--user=db --host=127.0.0.1 --port=3306 --databases db
> /var/www/dumps/db.sql
&& chown appuser:appuser /var/www/dumps/db.sql
retentionPolicy:
name: 'keep-last-1'
keepHourly: 96
keepDaily: 31
keepMonthly: 12
prune: true
If I comment out the hook, everything works fine. If I execute the hook in the php container, it works fine. But when running it in stash, I get a weird error E0413 15:22:16.878331 1 v2.go:163] short write.
I read somewhere that short write could mean no more free space... files I'm backing up is about 10gig, while the sql-dump is around 350mb. the PVC in use for the target statefulset has 8gig free space.
I0413 15:21:33.221680 1 log.go:184] FLAG: --log_backtrace_at=":0"
I0413 15:21:33.221779 1 log.go:184] FLAG: --log_dir=""
I0413 15:21:33.221863 1 log.go:184] FLAG: --logtostderr="true"
I0413 15:21:33.221951 1 log.go:184] FLAG: --master=""
I0413 15:21:33.222039 1 log.go:184] FLAG: --max-connections="0"
I0413 15:21:33.222120 1 log.go:184] FLAG: --metrics-enabled="true"
I0413 15:21:33.222208 1 log.go:184] FLAG: --pushgateway-url="http://stash-stash-community.kube-system.svc:56789"
I0413 15:21:33.222293 1 log.go:184] FLAG: --secret-dir="***REDACTED***"
I0413 15:21:33.222373 1 log.go:184] FLAG: --service-name="stash-operator"
I0413 15:21:33.222455 1 log.go:184] FLAG: --stderrthreshold="0"
I0413 15:21:33.222536 1 log.go:184] FLAG: --target-kind="StatefulSet"
I0413 15:21:33.222623 1 log.go:184] FLAG: --target-name="skilnet"
I0413 15:21:33.222704 1 log.go:184] FLAG: --use-kubeapiserver-fqdn-for-aks="true"
I0413 15:21:33.222797 1 log.go:184] FLAG: --v="3"
I0413 15:21:33.222879 1 log.go:184] FLAG: --vmodule=""
W0413 15:21:33.307554 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0413 15:21:33.426993 1 backupsession.go:450] BackupSession controller started successfully.
I0413 15:21:33.427096 1 backupsession.go:185] Sync/Add/Update for Backup Session ss-backup-1618327037
I0413 15:21:33.436952 1 backupsession.go:221] Skip processing BackupSession skilnet/ss-backup-1618327037. Reason: BackupSession has been processed already for host "host-0"
I0413 15:22:09.236049 1 backupsession.go:185] Sync/Add/Update for Backup Session ss-backup-1618327329
I0413 15:22:09.243674 1 backupsession.go:226] Skip processing BackupSession skilnet/ss-backup-1618327329. Reason: Backup process is not initiated by the operator
I0413 15:22:09.281314 1 backupsession.go:185] Sync/Add/Update for Backup Session ss-backup-1618327329
I0413 15:22:09.289665 1 backupsession.go:370] Attempting to acquire leadership for backup
I0413 15:22:09.289812 1 leaderelection.go:242] attempting to acquire leader lease skilnet/lock-statefulset-skilnet-backup...
I0413 15:22:09.301294 1 leaderelection.go:252] successfully acquired lease skilnet/lock-statefulset-skilnet-backup
I0413 15:22:09.301594 1 backupsession.go:402] Got leadership, preparing for backup
I0413 15:22:09.324833 1 commands.go:98] Checking whether the backend repository exist or not....
[golang-sh]$ /bin/restic snapshots --json --cache-dir /tmp/restic-cache
I0413 15:22:11.778310 1 commands.go:402] sh-output: [{"time":"2021-04-13T13:10:03.500921135Z","tree":"983f416befc8fa42d44029176144f29aa1ee2f1232b87806eace072fd533cd49","paths":["/var/www"],"hostname":"host-0","id":"606eb47b67e25776786b21ce96aa2644991331a80404e6268982b80d3fb024de","short_id":"606eb47b"},{"time":"2021-04-13T13:59:46.115997943Z","parent":"f56cce2fc2c9cad05838a29ed1bf6452bba501973f652c2075aa4d62c867da6d","tree":"65b7c34fef5c718b34e5d8639eaac26534707f2f8a6208ac2ea57f88dc084e28","paths":["/var/www/public"],"hostname":"host-0","id":"b0e5aaef07f2505788456103bf4e688288410d734527a70f0726611f8f6177a2","short_id":"b0e5aaef"},{"time":"2021-04-13T14:13:47.062059975Z","parent":"f74509eb1e0e7f690b9c55076e1fa34052afb0e2bbd3af3285b8158207e4e508","tree":"c198b93851c4b273eac144b9a93d0aaab9f772f254b516331496663a3a08140a","paths":["/var/www/public"],"hostname":"host-0","id":"b35ead9b3dc797ae9d4ea4f538197a69b3dcf2134f8a7c38d4ca27623cb40e75","short_id":"b35ead9b"},{"time":"2021-04-13T14:14:06.3415482Z","parent":"6b7e24d588090fa5fcfa4e966710af2d34f5f031143ced5e13121f2ef3e2c582","tree":"a5450148c32a49928dd88f5fcd9953b4512133bec6175dd810da1f18c06ec5ff","paths":["/var/www/db.sql"],"hostname":"host-0","id":"d74e67433ab2e3c324f501cebffb50dfa54f5d724b92248c9297ccb15274b385","short_id":"d74e6743"},{"time":"2021-04-13T14:32:23.123368173Z","parent":"606eb47b67e25776786b21ce96aa2644991331a80404e6268982b80d3fb024de","tree":"e52f73eafa1bd26745223344abc6b40364b45a09fdec7e8f4cb7905c0aafc597","paths":["/var/www"],"hostname":"host-0","id":"8d6df00ef619b254ea4f23f9ff6a0269021ae831aea74d2fafc7e948800af138","short_id":"8d6df00e"},{"time":"2021-04-13T15:00:11.721453843Z","parent":"b35ead9b3dc797ae9d4ea4f538197a69b3dcf2134f8a7c38d4ca27623cb40e75","tree":"f495a7211c9f2ca80a40073a978413d67c2dfc3e898234fd100347cc9132ab4a","paths":["/var/www/public"],"hostname":"host-0","id":"bff73bde6063e5da2971c4ab78c07df5a8348a53e17b423a08f3b03f5ddcca7c","short_id":"bff73bde"},{"time":"2021-04-13T15:00:30.974925366Z","tree":"eb0747d3ebe9ff07f20660d4f2a8bd2e1f94525f43a3e702b4aee699256c7505","paths":["/var/www/dumps"],"hostname":"host-0","id":"3063cd19faa31032d0eeb9828009f24644ff897f303d56f63451ed8c3962131c","short_id":"3063cd19"}]
I0413 15:22:11.800092 1 util.go:438] Executing preBackup hooks.........
E0413 15:22:16.878331 1 v2.go:163] short write
Also, nothing related is found in the controller:
I0413 15:22:09.234178 1 backup_session.go:118] Sync/Add/Update for BackupSession ss-backup-1618327329
I0413 15:22:09.255716 1 backup_session.go:235] Skipping processing BackupSession skilnet/ss-backup-1618327329 for target StatefulSet skilnet/skilnet. Reason: Backup model is sidecar.Controller inside sidecar will take care of it.
I0413 15:22:09.310601 1 backup_session.go:118] Sync/Add/Update for BackupSession ss-backup-1618327329
I0413 15:22:09.338230 1 backup_session.go:216] Skipping initiating backup for StatefulSet skilnet/skilnet. Reason: Backup has been already initiated for this target.
I0413 15:22:10.288128 1 jobs.go:69] Sync/Add/Update for Job stash-backup-ss-backup-manual-ryr6e
I0413 15:22:10.288244 1 jobs.go:72] Deleting succeeded job stash-backup-ss-backup-manual-ryr6e
I0413 15:22:10.308703 1 jobs.go:83] Deleted stash job: stash-backup-ss-backup-manual-ryr6e
W0413 15:22:10.309086 1 jobs.go:65] Job skilnet/stash-backup-ss-backup-manual-ryr6e does not exist anymore
I0413 15:22:11.794041 1 backup_session.go:118] Sync/Add/Update for BackupSession ss-backup-1618327329
I0413 15:22:11.819965 1 backup_session.go:216] Skipping initiating backup for StatefulSet skilnet/skilnet. Reason: Backup has been already initiated for this target.
I also experience the short write during Executing preBackup hooks.......... I execute the hook in the database container. However, sometimes the backup succeed and sometimes it gets stuck after the preBackup hook. When it gets stuck, the stash sidecar container is completely blocked forever and doesn't print any new log messages. Deleting the Backupsession does not help, I have to restart the stash sidecar to unblock it. When I exec into the stuck sidecar container and look at the backup files, based on the file timestamps it looks like the preBackup hook executed successfully. So the preBackup.exec.command is not the problem.
I looked at the code and I think the execution gets stuck in the hook execution which is actually a ExecProbe execution without timeout. I opened https://github.com/kmodules/prober/issues/16 so a timeout is enforced which should fail the hook and unblock the execution.
Thank you @Legion2 . We will look into it.
Thanks for looking in to it!
My backup session for making an influxdb backup hangs now for 18 days in running state and the last logged line is E0611 07:00:15.878676 1 v2.go:147] short write