stash icon indicating copy to clipboard operation
stash copied to clipboard

backup crashes with exec hook: "E0413: Short write"

Open Richard87 opened this issue 4 years ago • 5 comments

Hi!

I have just started trying out Stash, and have some issues with this BackupConfiguration:

apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
  name: ss-backup
  namespace: skilnet
spec:
  repository:
    name: gcs-repo
  schedule: "@hourly"
  runtimeSettings:
    container:
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
  target:
    ref:
      apiVersion: apps/v1
      kind: StatefulSet
      name: skilnet
    volumeMounts:
      - name: wordpress
        mountPath: /var/www
    paths:
      - /var/www/public
      - /var/www/dumps
  hooks:
    preBackup:
      containerName: php
      exec:
        command:
          - /bin/sh
          - -c
          - >
            mysqldump --verbose --single-transaction=TRUE --create-options db
            --user=db --host=127.0.0.1 --port=3306 --databases db
            > /var/www/dumps/db.sql
            && chown appuser:appuser /var/www/dumps/db.sql
  retentionPolicy:
    name: 'keep-last-1'
    keepHourly: 96
    keepDaily: 31
    keepMonthly: 12
    prune: true

If I comment out the hook, everything works fine. If I execute the hook in the php container, it works fine. But when running it in stash, I get a weird error E0413 15:22:16.878331 1 v2.go:163] short write.

I read somewhere that short write could mean no more free space... files I'm backing up is about 10gig, while the sql-dump is around 350mb. the PVC in use for the target statefulset has 8gig free space.

I0413 15:21:33.221680 1 log.go:184] FLAG: --log_backtrace_at=":0"
I0413 15:21:33.221779 1 log.go:184] FLAG: --log_dir=""
I0413 15:21:33.221863 1 log.go:184] FLAG: --logtostderr="true"
I0413 15:21:33.221951 1 log.go:184] FLAG: --master=""
I0413 15:21:33.222039 1 log.go:184] FLAG: --max-connections="0"
I0413 15:21:33.222120 1 log.go:184] FLAG: --metrics-enabled="true"
I0413 15:21:33.222208 1 log.go:184] FLAG: --pushgateway-url="http://stash-stash-community.kube-system.svc:56789"
I0413 15:21:33.222293 1 log.go:184] FLAG: --secret-dir="***REDACTED***"
I0413 15:21:33.222373 1 log.go:184] FLAG: --service-name="stash-operator"
I0413 15:21:33.222455 1 log.go:184] FLAG: --stderrthreshold="0"
I0413 15:21:33.222536 1 log.go:184] FLAG: --target-kind="StatefulSet"
I0413 15:21:33.222623 1 log.go:184] FLAG: --target-name="skilnet"
I0413 15:21:33.222704 1 log.go:184] FLAG: --use-kubeapiserver-fqdn-for-aks="true"
I0413 15:21:33.222797 1 log.go:184] FLAG: --v="3"
I0413 15:21:33.222879 1 log.go:184] FLAG: --vmodule=""
W0413 15:21:33.307554 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0413 15:21:33.426993 1 backupsession.go:450] BackupSession controller started successfully.
I0413 15:21:33.427096 1 backupsession.go:185] Sync/Add/Update for Backup Session ss-backup-1618327037
I0413 15:21:33.436952 1 backupsession.go:221] Skip processing BackupSession skilnet/ss-backup-1618327037. Reason: BackupSession has been processed already for host "host-0"
I0413 15:22:09.236049 1 backupsession.go:185] Sync/Add/Update for Backup Session ss-backup-1618327329
I0413 15:22:09.243674 1 backupsession.go:226] Skip processing BackupSession skilnet/ss-backup-1618327329. Reason: Backup process is not initiated by the operator
I0413 15:22:09.281314 1 backupsession.go:185] Sync/Add/Update for Backup Session ss-backup-1618327329
I0413 15:22:09.289665 1 backupsession.go:370] Attempting to acquire leadership for backup
I0413 15:22:09.289812 1 leaderelection.go:242] attempting to acquire leader lease skilnet/lock-statefulset-skilnet-backup...
I0413 15:22:09.301294 1 leaderelection.go:252] successfully acquired lease skilnet/lock-statefulset-skilnet-backup
I0413 15:22:09.301594 1 backupsession.go:402] Got leadership, preparing for backup
I0413 15:22:09.324833 1 commands.go:98] Checking whether the backend repository exist or not....
[golang-sh]$ /bin/restic snapshots --json --cache-dir /tmp/restic-cache
I0413 15:22:11.778310 1 commands.go:402] sh-output: [{"time":"2021-04-13T13:10:03.500921135Z","tree":"983f416befc8fa42d44029176144f29aa1ee2f1232b87806eace072fd533cd49","paths":["/var/www"],"hostname":"host-0","id":"606eb47b67e25776786b21ce96aa2644991331a80404e6268982b80d3fb024de","short_id":"606eb47b"},{"time":"2021-04-13T13:59:46.115997943Z","parent":"f56cce2fc2c9cad05838a29ed1bf6452bba501973f652c2075aa4d62c867da6d","tree":"65b7c34fef5c718b34e5d8639eaac26534707f2f8a6208ac2ea57f88dc084e28","paths":["/var/www/public"],"hostname":"host-0","id":"b0e5aaef07f2505788456103bf4e688288410d734527a70f0726611f8f6177a2","short_id":"b0e5aaef"},{"time":"2021-04-13T14:13:47.062059975Z","parent":"f74509eb1e0e7f690b9c55076e1fa34052afb0e2bbd3af3285b8158207e4e508","tree":"c198b93851c4b273eac144b9a93d0aaab9f772f254b516331496663a3a08140a","paths":["/var/www/public"],"hostname":"host-0","id":"b35ead9b3dc797ae9d4ea4f538197a69b3dcf2134f8a7c38d4ca27623cb40e75","short_id":"b35ead9b"},{"time":"2021-04-13T14:14:06.3415482Z","parent":"6b7e24d588090fa5fcfa4e966710af2d34f5f031143ced5e13121f2ef3e2c582","tree":"a5450148c32a49928dd88f5fcd9953b4512133bec6175dd810da1f18c06ec5ff","paths":["/var/www/db.sql"],"hostname":"host-0","id":"d74e67433ab2e3c324f501cebffb50dfa54f5d724b92248c9297ccb15274b385","short_id":"d74e6743"},{"time":"2021-04-13T14:32:23.123368173Z","parent":"606eb47b67e25776786b21ce96aa2644991331a80404e6268982b80d3fb024de","tree":"e52f73eafa1bd26745223344abc6b40364b45a09fdec7e8f4cb7905c0aafc597","paths":["/var/www"],"hostname":"host-0","id":"8d6df00ef619b254ea4f23f9ff6a0269021ae831aea74d2fafc7e948800af138","short_id":"8d6df00e"},{"time":"2021-04-13T15:00:11.721453843Z","parent":"b35ead9b3dc797ae9d4ea4f538197a69b3dcf2134f8a7c38d4ca27623cb40e75","tree":"f495a7211c9f2ca80a40073a978413d67c2dfc3e898234fd100347cc9132ab4a","paths":["/var/www/public"],"hostname":"host-0","id":"bff73bde6063e5da2971c4ab78c07df5a8348a53e17b423a08f3b03f5ddcca7c","short_id":"bff73bde"},{"time":"2021-04-13T15:00:30.974925366Z","tree":"eb0747d3ebe9ff07f20660d4f2a8bd2e1f94525f43a3e702b4aee699256c7505","paths":["/var/www/dumps"],"hostname":"host-0","id":"3063cd19faa31032d0eeb9828009f24644ff897f303d56f63451ed8c3962131c","short_id":"3063cd19"}]

I0413 15:22:11.800092 1 util.go:438] Executing preBackup hooks.........
E0413 15:22:16.878331 1 v2.go:163] short write

Also, nothing related is found in the controller:

I0413 15:22:09.234178 1 backup_session.go:118] Sync/Add/Update for BackupSession ss-backup-1618327329
I0413 15:22:09.255716 1 backup_session.go:235] Skipping processing BackupSession skilnet/ss-backup-1618327329 for target StatefulSet skilnet/skilnet. Reason: Backup model is sidecar.Controller inside sidecar will take care of it.
I0413 15:22:09.310601 1 backup_session.go:118] Sync/Add/Update for BackupSession ss-backup-1618327329
I0413 15:22:09.338230 1 backup_session.go:216] Skipping initiating backup for StatefulSet skilnet/skilnet. Reason: Backup has been already initiated for this target.
I0413 15:22:10.288128 1 jobs.go:69] Sync/Add/Update for Job stash-backup-ss-backup-manual-ryr6e
I0413 15:22:10.288244 1 jobs.go:72] Deleting succeeded job stash-backup-ss-backup-manual-ryr6e
I0413 15:22:10.308703 1 jobs.go:83] Deleted stash job: stash-backup-ss-backup-manual-ryr6e
W0413 15:22:10.309086 1 jobs.go:65] Job skilnet/stash-backup-ss-backup-manual-ryr6e does not exist anymore
I0413 15:22:11.794041 1 backup_session.go:118] Sync/Add/Update for BackupSession ss-backup-1618327329
I0413 15:22:11.819965 1 backup_session.go:216] Skipping initiating backup for StatefulSet skilnet/skilnet. Reason: Backup has been already initiated for this target.

Richard87 avatar Apr 13 '21 15:04 Richard87

I also experience the short write during Executing preBackup hooks.......... I execute the hook in the database container. However, sometimes the backup succeed and sometimes it gets stuck after the preBackup hook. When it gets stuck, the stash sidecar container is completely blocked forever and doesn't print any new log messages. Deleting the Backupsession does not help, I have to restart the stash sidecar to unblock it. When I exec into the stuck sidecar container and look at the backup files, based on the file timestamps it looks like the preBackup hook executed successfully. So the preBackup.exec.command is not the problem.

Legion2 avatar May 01 '21 09:05 Legion2

I looked at the code and I think the execution gets stuck in the hook execution which is actually a ExecProbe execution without timeout. I opened https://github.com/kmodules/prober/issues/16 so a timeout is enforced which should fail the hook and unblock the execution.

Legion2 avatar May 01 '21 10:05 Legion2

Thank you @Legion2 . We will look into it.

hossainemruz avatar May 01 '21 11:05 hossainemruz

Thanks for looking in to it!

Richard87 avatar May 01 '21 17:05 Richard87

My backup session for making an influxdb backup hangs now for 18 days in running state and the last logged line is E0611 07:00:15.878676 1 v2.go:147] short write

Legion2 avatar Jun 29 '21 15:06 Legion2