data-store
data-store copied to clipboard
Make GCP/(AWS?) configuration optional for a successful test run
User story
As a first-time deployment user, I would like to run the test fixture populate script without GCP (aiming at an AWS-only shop/setup?).
I'm getting this after specifying the variable on environment:
$ tests/fixtures/populate.py --s3-bucket $DSS_S3_BUCKET_TEST_FIXTURES
Traceback (most recent call last):
File "tests/fixtures/populate.py", line 10, in <module>
from tests.fixtures.populate_lib import populate
File "/home/ubuntu/data-store/tests/__init__.py", line 41, in <module>
"gcp-credentials.json"
File "/home/ubuntu/hca-venv/lib/python3.7/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/ubuntu/hca-venv/lib/python3.7/site-packages/botocore/client.py", line 612, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the GetSecretValue operation: Secrets Manager can't find the specified secret.
In other words, is the GCP section of the README absolutely mandatory? If not, which (alternative) steps do you advise as a circumvention?
Definition of done
Not requiring GCP config section when deploying to AWS. I.e: cloud data transfer between commercial services can increase costs unnecessarily.
From Slack: "You'll need to comment out a few functions and classes and disable dss sync, DSS-event-relay, and a few other things."
I recently got some fair and reasonable questions about this issue.
Namely, from a design perspective I'm trying to understand the AWS+GCP hard requirement here... each of those clouds have pretty high 9's nowadays. Is the rationale behind having both is "because multi-cloud is better" mantra?
Can somebody point me to the documented design justification for it? I feel like I'm missing the point here. Thanks in advance!
Related with issue https://github.com/HumanCellAtlas/data-store/issues/570
The design justification for having a replica of data in both AWS and GCP is not for added reliability but rather for availability. We would like users to be able to access the data from their cloud account regardless of if they have an AWS or a GCP account. We don't want to dictate which type of account they must have. Our long term goal is to expand out to cover other cloud vendors as well.
Then you seem to be optimizing for neither availability nor reliability.
I guess you meant (cloud/codebase) portability instead? Availability in the context of cloud computing is defined as:
Availability in this context is how much time the service provider guarantees that your
data and services are available. This is typically documented as a percent of time per
year, e.g. 99.999% (or five nines) uptime means you will be unable to access resources
for no more than about five minutes per year.
If that's the case and you are optimizing for multi-cloud, here's a good read and eye opener I went through a while ago:
https://bravenewgeek.com/multi-cloud-is-a-trap/
Not trying to be nitty-gritty about definitions and such, just sharing my experiences when trying to go multi-cloud and spending too much time focusing on the "wrong" problem space, hope that helps.
Not pursuing this feature at this time
Fine @melainalegaspi, then at least specify in the README.md testing section that two commercial clouds (and which ones) are required to run the test suite.
@brainstorm good point, will do.