datacube-core
datacube-core copied to clipboard
Configuration behaviour is confusing and poorly documented.
Expected behaviour
The behaviour of the $DATACUBE_DB_URL
environment variable is clearly documented.
Actual behaviour
The $DATACUBE_DB_URL
environment variable is not mentioned at all in the docs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Lets expand the scope of this to cover documenting how ODC finds it's configuration and determines which Database to connect to.
The only existing documentation is Installing & Managing -> Configuring -> Database Setup -> Create Configuration File
I would expect to find information in the Data Access ... -> Connecting to ODC, but this page has a non-link to Setting up your environment. The link needs fixing, if that page even exists.
I think this should also be documented in the API docs for datacube.Datacube
.
The current behaviour is complicated and confusing, involving many possible environment variables and configuration files, on top of flags to CLI applications and arguments passed to datacube.Datacube()
when using the Python API.
The precedence of configuration files is in https://github.com/opendatacube/datacube-core/blob/develop/datacube/config.py#L19-L30 , and the logic in the rest of config.py determines how they're used.
Configuration file precedence
-
/etc/datacube.conf
- File referenced by the
$DATACUBE_CONFIG_PATH
environment variable. -
~/.datacube.conf
-
datacube.conf
in the current working directory ($PWD
)
Environment Variables
DATACUBE_CONFIG_PATH
DATACUBE_ENVIRONMENT
DATACUBE_IAM_AUTHENTICATION
DATACUBE_DB_URL
DATACUBE_IAM_TIMEOUT
DB_HOSTNAME
, DB_USERNAME
, DB_PASSWORD
, DB_DATABASE
These are on top of all the PostgreSQL configuration options, including ~/.pgpass
, PGUSER
, PGDATABASE
, PGHOST
, PGPORT
, PGPASSWORD
I've found an "interesting" undocumented feature lurking in the ODC configuration code. If you run python -m datacube
(or call the function datacube.config.auto_config()
, a new configuration file will be written to ~/.datacube.conf
based upon current environment variables. 🤯
There are a bunch of tests on how the configuration settings work in tests/test_config.py
, but they don't include enough commentary to be useful as documentation.
The documentation for the configuration file needs improving too. The canonical source in the docs seems to be Database Setup -> Create Configuration File.
It is mostly made up of a commented example configuration file, but doesn't mention db_port
, or any of the AWS specific authentication options which are supported. The section of code responsible appears to be datacube/drivers/postgres/_connections.py
Oh my, there's more complexity in choosing the "Datacube Environment" to use.
The priority for choosing the environment is (to the best of my understanding):
- The
env=
value passed as an argument when creatingDatacube()
. - The environment variable
DATACUBE_ENVIRONMENT
- An undocumented value stored in the configuration file/s.
[user]
default_environment=value
-
default
-
datacube
(See LocalConfig.__init__()
in datacube/config.py
)
Also: I don't think "Datacube Environment" is defined clearly anywhere.
Okay, after thinking I understood this now, and tearing my hair out.... there's more.
The above logic is only called when creating a datacube.Datacube()
in code.
If you're using the datacube
CLI tools, they use completely different logic, see datacube.config.LocalConfig.find()
.
It respects the environment variables DATACUBE_DB_URL
, DB_[USERNAME|HOSTNAME|PORT|DATABASE|PASSWORD]
, but doesn't support DATACUBE_CONFIG_PATH
or DATACUBE_ENVIRONMENT
!! Those must be passed as command line flags to the datacube
command.
Oh wow, I've been through this code fairly recently but didn't spot all of these gotchas.
See also @Kirill888 's comments on #1329 re: using postgres with conda.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Configuration file precedence
- /etc/datacube.conf
- File referenced by the $DATACUBE_CONFIG_PATH environment variable.
- ~/.datacube.conf
- datacube.conf in the current working directory ($PWD)
Note this list is in reverse precedence order - the ones lower in the list override the one's higher in the list. This means that the priority of the environment variable is surprisingly low - should really be the highest precedence.
the ones lower in the list override the one's higher in the list.
@SpacemanPaul can you confirm what granularity of the override is?
- Whole file
- Environment section
- Individual settings
Is it fusing multiple configs "somehow" or just picks last found one?
Also I'm guessing that $DATACUBE_CONFIG_PATH
is really meant to be:
"alternative to global config that is usually in /etc/datacube.conf
, but we can't have that on something like NCI, so we add environment variable for that."
the ones lower in the list override the one's higher in the list.
@SpacemanPaul can you confirm what granularity of the override is?
* Whole file * Environment section * Individual settings
Is it fusing multiple configs "somehow" or just picks last found one?
I'm just looking starting to look at that now - but on first glance, yes, it looks like it might be fusing them (and there's even yet another base default config not mentioned in this list that is defined statically in the source code). Immediate progress will be slow because I'm about to go on leave but cleaning this up for 1.9 is my next priority - I'll clean up the 1.8 documentation at the same time.
The above logic is only called when creating a
datacube.Datacube()
in code.If you're using the
datacube
CLI tools, they use completely different logic, seedatacube.config.LocalConfig.find()
.
Looks to me like Datacube()
calls LocalConfig.find
as well - see the normalise_config
function in Datacube.__init__()
?
Then there's datacube.ui.task_app
which contains a bunch of utilities and config options that do not appear to be actually used anywhere in core, or in odc-tools. It's not clear to me what other repos might be using these functions but gee it would be nice if I could just delete the whole file.
I think it's all tied up with the executer/worker framework which iirc we had already said we were going to remove. (We already partly removed celery support).
Honestly, the behaviour in 1.8 is so complex, conditional and inconsistent, I'm not sure it's even possible to write complete, accurate and comprehensible documentation.
This threads sums up nicely our "understanding" - we use everything but the IAM stuff at the moment, but that's because it's completely undocumented and I've not been brave enough to dig into the rabbit hole....
Discussion continues: https://github.com/opendatacube/datacube-core/wiki/ODC-EP-010---Replace-Configuration-Layer
EP-10 ^^ is now mostly complete design. Thanks to Damien and Rob for their feedback while working through this.
I'm pretty sure the new design provides a way to do everything that current system can do except merging multiple files - and as I discuss, with the way I have the config loading process laid out, I think it would be possible to do multiple file merging with only minor tweaks if we decide we can't do without it - but I think some of the other new features I've introduced make it unnecessary.