Update documentation around credentials management
Description
Short description of the problem here.
parquet_dataset:
type: dask.ParquetDataSet
filepath: "s3://bucket_name/path/to/folder"
credentials:
client_kwargs:
aws_access_key_id: YOUR_KEY
aws_secret_access_key: "YOUR SECRET"
This is the way we mentioned how to provide credentials in Kedro's doc. However fsspec has update the API for quite a while and if you are using newer version of fsspec, you should use key,secret instead of aws_access_key_id instead.
It could be only affecting s3fs (This is how I bump into error), but potentially affect gcs and more.
Context
The docs on credentials are out of date and mention wrong key names. All doc chapters mentioning credentials should be updated to use the correct keys.
Today I was helping @ricardopicon-mck and it was not clear how to use Google Cloud credentials. There are excellent examples of how to set up the catalog.yml:
https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-an-excel-file-from-google-cloud-storage
But how does credentials.yml look in that case?
For the record, this did the trick for me:
gcp_credentials:
token: gcp_credentials.json
But this only worked with a flat file structure. When having a full-fledged Kedro project with conf/base and conf/local, I had to specify the absolute path:
gcp_credentials:
token: /Users/juan_cano/Projects/QuantumBlack Labs/tmp/test-credentials/conf/local/gcp_credentials.json
I'm sure there is a better way.
In general, the credentials page is not very useful: https://docs.kedro.org/en/stable/configuration/credentials.html
It places a lot of emphasis in how to load them from code, but I'd consider this "advanced" or "programmatic" usage, which is not how most users experience Kedro.
(see also https://github.com/fsspec/gcsfs/issues/583)
That's a good point and this page needs a clean up to bring up to the same standards as the recent data catalog updates.
See this for reference https://github.com/kedro-org/kedro/issues/3164
We might need to document as well how credentials work during development vs in production, see this response by @noklam to a Prefect user https://linen-slack.kedro.org/t/16019525/hi-another-question-is-there-a-way-to-directly-store-the-con#146bb5db-314d-414f-947a-fd9d64f4d223
There are more problems with the snippet @noklam shared. This is a setup that worked for me:
# catalog.yml
executive_summary:
type: text.TextDataset
filepath: s3://social-summarizer/executive-summary.txt
versioned: true
credentials: minio_fsspec
# credentials.yml
minio_fsspec:
endpoint_url: "http://127.0.0.1:9010"
key: "minioadmin"
secret: "minioadmin"
This worked fine. But if I put the endpoint_url, key, secret inside client_kwargs, then I get
DatasetError: Failed while loading data from data set TextDataset(filepath=social-summarizer/executive-summary.txt, protocol=s3,
version=Version(load=None, save='2023-11-25T10.02.34.586Z')).
AioSession._create_client() got an unexpected keyword argument 'key'
The fact that our dataset code is so contrived doesn't help:
https://github.com/kedro-org/kedro/blob/e8f1bfd72992336ec12591b49a5fa2654217472f/kedro/extras/datasets/text/text_dataset.py#L84-L94
(the "copy paste" problems mentioned in #1778)
For the record, I'm using fsspec==2023.10.0.
I think we should do this after #3811