cain
cain copied to clipboard
Cain stuck file copy
Hi Maor,
I'm trying to restore, but almost every time some file gets stuck during the copy and cain 0.5.1 hangs.
I tried to do some tunings with buffer size/parallelism but no success, cain randomly gets stuck at certain file copy. In this state after a while, the tcp connection towards minio disappears from netstat output, but cain remains still alive.
Any idea to increase verbosity of the copy process ?
Regards
Could this be that these are large files? Can you check in k8s if the file size changes during the copy?
well the schema has only a few records since I'm trying on a test minimal installation, if a do a du on the minio folder the total is less than 13M
du -skh minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/*
3.4M minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0
3.4M minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-1
3.4M minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-2
btw I'm running kubernetes v1.11.5 with flannel host-gw setup
on cassandra-0 the total data amounts to:
sudo du -skh /mnt/disks/cassandra/data/thingsboard
9.4M /mnt/disks/cassandra/data/thingsboard
same on the other nodes
Lets try to narrow this down. Can you try to do a copy from minio to k8s using skbn?
Sure give a minute
ok skbn seems to be working properly
created a file on minio
sudo dd if=/dev/zero of=/mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.50469 s, 429 MB/s
Run skbn to copy that file into k8s container
kubectl run cassandra-restore --rm --serviceaccount='cassandra-backup' -i --tty --restart=Never --image-pull-policy=IfNotPresent --image nuvo/skbn --env 'AWS_ACCESS_KEY_ID=admin' --env 'AWS_SECRET_ACCESS_KEY=*****' --env 'AWS_S3_NO_SSL=true' --env 'AWS_S3_FORCE_PATH_STYLE=true' --env 'AWS_S3_ENDPOINT=http://minio-svc.cfs.svc.cluster.local:9000' --command -- sh
If you don't see a command prompt, try pressing enter.
~ $
~ $ skbn cp --src s3://db-backup/abigfile --dst k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:02:54 [1/1] copy: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:03:00 [1/1] done: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data
Check the file copied
md5sum /mnt/disks/cassandra/stdin
cd573cfaace07e7949bc0c46028904ff /mnt/disks/cassandra/stdin
md5sum /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile
cd573cfaace07e7949bc0c46028904ff /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile
Maor, looking at your skbn PerformCopy code
https://github.com/nuvo/skbn/blob/42781bdb9d5cd81fcda5a6ac44a17e0480fb0e94/pkg/skbn/skbn.go#L139
I see you are using nio buffers, maybe the hang process is due to some race condition provoked by the goroutines pipew and piper. Probably converting piper goroutine to a standard function could be a good test to see if that is the cause..
When cain gets stuck I can only see "copy:" log output, the instead "done:" never appears.
What do you think?
These routines are running concurrently, allowing copy to be done using a pipe. This has to be 2 goroutines...
See nuvo/skbn#3 for details
Then the stuck is either in Download/Upload functions..
Probably in download. Can you try the same again, but with a file that gets stuck?
Unfortunately is not a particular file, when running cain it randomly stops every time on different ( very small ) files. Only a couple of times It did finish the job.
Funny thing is backup that runs 2x faster and it never gets stuck
this is a short gif of the stuck

If minio is a pod in the cluster, you can try treating it as k8s://... Give it a shot, as a work around :)
Cool idea ! I will try thanks
no luck I got stuck here this time :(
cain restore --src 'k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster' -n cfs -k thingsboard -t
20190112212203 --cassandra-data-dir /cassandra_data/data --buffer-size 1 -l app=cassandra
...
2019/01/13 17:37:18 [0372/1674] copy: k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0/event_by_id/manifest.json -> k8s://cfs/cassandra-0/cassandra/cassandra_data/data/thingsboard/event_by_id-42d57b20174511e986ce69f7ad260f0d/manifest.json
I want to assume this is an issue with minio, but can't verify at this time...
well using k8s:// same result I guess is something that happens during the PerformCopy stuff
Is this project still active? I seem to be having this same issue writing from cassandra cluster on eks to s3. Tried multiple times and it gets stuck at random parts each time.