cassandra-medusa
cassandra-medusa copied to clipboard
Cassandra backup failed because of mulitple default profiles in ~/.aws/config
Since we updated our production environment with
- cassandra 3.11.9
- medusa 0.10.0 (was 0.7.1)
we ran into some strange issues that often prevent backup to end up well. By often, I mean that some time it runs well, and sometime if fails (and not always the same nodes)
For an unknown reason, some nodes fail with the following error :
[1] 12:00:24 [FAILURE] <ip> Exited with error code 1
Stderr: [2021-05-18 12:00:23,717] DEBUG: Loading configuration from /etc/medusa/medusa.ini
[2021-05-18 12:00:23,719] DEBUG: Logging to file options: LoggingConfig(enabled='1', file='/var/log/medusa/medusa.log', format='[%(asctime)s] %(levelname)s: %(message)s', level='INFO', maxBytes='2000000', backupCount='2')
[2021-05-18 12:00:23,720] INFO: Monitoring provider is noop
[2021-05-18 12:00:23,720] DEBUG: Loading storage_provider: s3_eu_west
[2021-05-18 12:00:23,720] DEBUG: Getting IAM Role:
[2021-05-18 12:00:23,726] DEBUG: Reading AWS credentials from IAM Role: cassandra12-20200313151108838000000004
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/botocore/configloader.py", line 149, in raw_config_parse
cp.read([path])
File "/usr/lib64/python3.7/configparser.py", line 696, in read
self._read(fp, filename)
File "/usr/lib64/python3.7/configparser.py", line 1065, in _read
lineno)
configparser.DuplicateSectionError: While reading from '/home/cassandra/.aws/config' [line 4]: section 'default' already exists
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/aws", line 27, in <module>
sys.exit(main())
File "/usr/local/bin/aws", line 23, in main
return awscli.clidriver.main()
File "/usr/local/lib/python3.7/site-packages/awscli/clidriver.py", line 69, in main
driver = create_clidriver()
File "/usr/local/lib/python3.7/site-packages/awscli/clidriver.py", line 78, in create_clidriver
load_plugins(session.full_config.get('plugins', {}),
File "/usr/local/lib/python3.7/site-packages/botocore/session.py", line 368, in full_config
self._config = botocore.configloader.load_config(config_file)
File "/usr/local/lib/python3.7/site-packages/botocore/configloader.py", line 106, in load_config
parsed = raw_config_parse(config_filename)
File "/usr/local/lib/python3.7/site-packages/botocore/configloader.py", line 152, in raw_config_parse
path=_unicode_path(path))
botocore.exceptions.ConfigParseError: Unable to parse config file: /home/cassandra/.aws/config
[2021-05-18 12:00:24,025] ERROR: This error happened during the backup: Command '['aws', 'configure', 'set', 'default.s3.max_bandwidth', '50MB/s']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/medusa/backup_node.py", line 165, in main
storage = Storage(config=config.storage)
File "/usr/local/lib/python3.7/site-packages/medusa/storage/__init__.py", line 72, in __init__
self.storage_driver = self._connect_storage()
File "/usr/local/lib/python3.7/site-packages/medusa/storage/__init__.py", line 92, in _connect_storage
s3_storage = S3Storage(self._config)
File "/usr/local/lib/python3.7/site-packages/medusa/storage/abstract_storage.py", line 39, in __init__
self.driver = self.connect_storage()
File "/usr/local/lib/python3.7/site-packages/medusa/storage/s3_storage.py", line 106, in connect_storage
self.set_upload_bandwidth()
File "/usr/local/lib/python3.7/site-packages/medusa/storage/s3_base_storage.py", line 201, in set_upload_bandwidth
self.config.transfer_max_bandwidth,
File "/usr/local/lib64/python3.7/site-packages/gevent/subprocess.py", line 331, in check_call
raise CalledProcessError(retcode, cmd) # pylint:disable=undefined-variable
subprocess.CalledProcessError: Command '['aws', 'configure', 'set', 'default.s3.max_bandwidth', '50MB/s']' returned non-zero exit status 1.
After some researchs, it appears that the .aws/config
file is indeed problematic because it contains duplicate default profile
[default]
s3 =
max_bandwidth = 50MB/s
[default]
s3 =
max_bandwidth = 50MB/s
NB : this file is located under user cassandra
home directory, which is the user used by medusa also.
Maybe useful gathered informations :
- Cassandra version : 3.11.9
- Medusa version : 0.10.0 (with s3 profile)
- Medusa concurrency cant be lower :)
; Max number of downloads/uploads. Not used by the GCS backend.
concurrent_transfers = 1
- Medusa backup is run as follows from another server :
parallel-ssh to each cassandra node, then run
medusa backup
on each node (no use ofmedusa backup-cluster
for now) - Not any program using
aws-cli
is running on this instance (and medusa is the only one, apart cassandra, running with user cassandra - I tried both completely removing the config files and cure them (by letting only one instance of default profile) but everything failed sooner or later
- Everything is running flawlessly in our QA environment (which has exactly the same components version. Only difference is data quantity and cassandra nodes number)
- I tried to audit what caused the
.aws/config
file modification, so I set some audit up . (-p wa
is to only get what a log about what try to write or append in this file. To be not poluted by read access)
auditctl -w /home/cassandra/.aws/config -p wa -k aws_conf_change
and here is what I got The first audit log is when I deleted the file.
[root@ip-xxx ~]# ausearch -f /home/cassandra/.aws/config
----
time->Fri May 14 11:56:06 2021
type=PROCTITLE msg=audit(1620993366.766:94772): proctitle=726D002D66002F686F6D652F63617373616E6472612F2E6177732F636F6E666967
type=PATH msg=audit(1620993366.766:94772): item=1 name="/home/cassandra/.aws/config" inode=661988 dev=103:01 mode=0100600 ouid=1002 ogid=1002 rdev=00:00 nametype=DELETE cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1620993366.766:94772): item=0 name="/home/cassandra/.aws/" inode=661680 dev=103:01 mode=040775 ouid=1002 ogid=1002 rdev=00:00 nametype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1620993366.766:94772): cwd="/home/cassandra"
type=SYSCALL msg=audit(1620993366.766:94772): arch=c000003e syscall=263 success=yes exit=0 a0=ffffff9c a1=2239310 a2=0 a3=166 items=2 ppid=18061 pid=18062 auid=1002 uid=1002 gid=1002 euid=1002 suid=1002 fsuid=1002 egid=1002 sgid=1002 fsgid=1002 tty=(none) ses=12902 comm="rm" exe="/usr/bin/rm" key="aws_conf_change"
----
time->Fri May 14 12:00:25 2021
type=PROCTITLE msg=audit(1620993625.449:94938): proctitle=2F7573722F62696E2F707974686F6E33002F7573722F6C6F63616C2F62696E2F61777300636F6E666967757265007365740064656661756C742E73332E6D61785F62616E6477696474680035304D422F73
type=PATH msg=audit(1620993625.449:94938): item=1 name="/home/cassandra/.aws/config" inode=661691 dev=103:01 mode=0100600 ouid=1002 ogid=1002 rdev=00:00 nametype=CREATE cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1620993625.449:94938): item=0 name="/home/cassandra/.aws/" inode=661680 dev=103:01 mode=040775 ouid=1002 ogid=1002 rdev=00:00 nametype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1620993625.449:94938): cwd="/home/cassandra"
type=SYSCALL msg=audit(1620993625.449:94938): arch=c000003e syscall=257 success=yes exit=3 a0=ffffffffffffff9c a1=7f825b25bfd0 a2=80041 a3=180 items=2 ppid=18249 pid=18265 auid=1002 uid=1002 gid=1002 euid=1002 suid=1002 fsuid=1002 egid=1002 sgid=1002 fsgid=1002 tty=(none) ses=12909 comm="aws" exe="/usr/bin/python3.7" key="aws_conf_change"
----
time->Fri May 14 12:00:25 2021
type=PROCTITLE msg=audit(1620993625.449:94939): proctitle=2F7573722F62696E2F707974686F6E33002F7573722F6C6F63616C2F62696E2F61777300636F6E666967757265007365740064656661756C742E73332E6D61785F62616E6477696474680035304D422F73
type=PATH msg=audit(1620993625.449:94939): item=1 name="/home/cassandra/.aws/config" inode=661691 dev=103:01 mode=0100600 ouid=1002 ogid=1002 rdev=00:00 nametype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1620993625.449:94939): item=0 name="/home/cassandra/.aws/" inode=661680 dev=103:01 mode=040775 ouid=1002 ogid=1002 rdev=00:00 nametype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1620993625.449:94939): cwd="/home/cassandra"
type=SYSCALL msg=audit(1620993625.449:94939): arch=c000003e syscall=257 success=yes exit=3 a0=ffffffffffffff9c a1=7fbcbfb6ff50 a2=80041 a3=180 items=2 ppid=18222 pid=18264 auid=1002 uid=1002 gid=1002 euid=1002 suid=1002 fsuid=1002 egid=1002 sgid=1002 fsgid=1002 tty=(none) ses=12908 comm="aws" exe="/usr/bin/python3.7" key="aws_conf_change"
----
time->Fri May 14 12:00:25 2021
type=PROCTITLE msg=audit(1620993625.449:94940): proctitle=2F7573722F62696E2F707974686F6E33002F7573722F6C6F63616C2F62696E2F61777300636F6E666967757265007365740064656661756C742E73332E6D61785F62616E6477696474680035304D422F73
type=PATH msg=audit(1620993625.449:94940): item=1 name="/home/cassandra/.aws/config" inode=661691 dev=103:01 mode=0100600 ouid=1002 ogid=1002 rdev=00:00 nametype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1620993625.449:94940): item=0 name="/home/cassandra/.aws/" inode=661680 dev=103:01 mode=040775 ouid=1002 ogid=1002 rdev=00:00 nametype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1620993625.449:94940): cwd="/home/cassandra"
type=SYSCALL msg=audit(1620993625.449:94940): arch=c000003e syscall=257 success=yes exit=3 a0=ffffffffffffff9c a1=7f825b25bfd0 a2=80441 a3=1b6 items=2 ppid=18249 pid=18265 auid=1002 uid=1002 gid=1002 euid=1002 suid=1002 fsuid=1002 egid=1002 sgid=1002 fsgid=1002 tty=(none) ses=12909 comm="aws" exe="/usr/bin/python3.7" key="aws_conf_change"
----
time->Fri May 14 12:00:25 2021
type=PROCTITLE msg=audit(1620993625.449:94941): proctitle=2F7573722F62696E2F707974686F6E33002F7573722F6C6F63616C2F62696E2F61777300636F6E666967757265007365740064656661756C742E73332E6D61785F62616E6477696474680035304D422F73
type=PATH msg=audit(1620993625.449:94941): item=1 name="/home/cassandra/.aws/config" inode=661691 dev=103:01 mode=0100600 ouid=1002 ogid=1002 rdev=00:00 nametype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1620993625.449:94941): item=0 name="/home/cassandra/.aws/" inode=661680 dev=103:01 mode=040775 ouid=1002 ogid=1002 rdev=00:00 nametype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1620993625.449:94941): cwd="/home/cassandra"
type=SYSCALL msg=audit(1620993625.449:94941): arch=c000003e syscall=257 success=yes exit=3 a0=ffffffffffffff9c a1=7fbcbfb6ff10 a2=80441 a3=1b6 items=2 ppid=18222 pid=18264 auid=1002 uid=1002 gid=1002 euid=1002 suid=1002 fsuid=1002 egid=1002 sgid=1002 fsgid=1002 tty=(none) ses=12908 comm="aws" exe="/usr/bin/python3.7" key="aws_conf_change"
[root@ip-ip ~]# cat /home/cassandra/.aws/config
[default]
s3 =
max_bandwidth = 50MB/s
[default]
s3 =
max_bandwidth = 50MB/s
[root@ip-xxx ~]# grep 1002 /etc/passwd
cassandra:x:1002:1002::/home/cassandra:/bin/bash
┆Issue is synchronized with this Jira Task by Unito ┆Issue Number: K8SSAND-190
I'm experiencing this issue too. Same symptoms: Sometimes, but not every time a second default profile is created in the aws config file with the max_bandwidth
parameter. Medusa fails with the error "Unable to parse config file".
edit: Medusa 0.11.0
@giom-l Same here. Even I upgraded to 0.12.1 version. It seems like happen randomly to me.
Have you tried that we erase the content of ~/.aws/config before Medusa run the backup if no customized parameters in ~/.aws/config ?
Looks like Medusa 0.13.4 is yet to resolve this duplicate issue.
Short term solution is to wipe out all content of .aws/config for Medusa.
-Sheng
We are experiencing this issue as well. Seems to happen randomly as others have mentioned. Are there plans to put a fix in for this?
One idea - is it possible for Medusa to allow the duplicate config when it parses it?
Hello all. Could you please check with Medusa 0.19.1? We changed the way we talk to the storage a while back, and all this might have gotten solved.