filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

Feature Request: single environment variable config for protocol

Open mauvilsa opened this issue 2 years ago • 6 comments

Currently it is possible to provide overrides with environment variables of the style FSSPEC_{protocol}_{kwargname}=value. However, this is limited since it is not possible to provide a dict value. One example where this is needed is s3 with a custom endpoint_url, see https://github.com/fsspec/s3fs/issues/432.

I propose to support a single environment variable FSSPEC_{protocol} expecting its value to be json or yaml (parsed with pyyaml). It is just an extension so there would be no breaking changes. Current individual variables FSSPEC_{protocol}_{kwargname}=value would override the more general FSSPEC_{protocol}.

For example for the s3 protocol, the environment variable name would be FSSPEC_S3. Someone could set shell variables as:

$ FSSPEC_S3='
{
  "key": "my_key",
  "secret": "my_secret",
  "client_kwargs": {
    "endpoint_url": "https://custom-s3-host"
  }
}'
$ FSSPEC_S3_SECRET='my_other_secret'

Then the final arguments for the s3 protocol would be:

key = 'my_key'
secret = 'my_other_secret'
client_kwargs = {'endpoint_url': 'https://custom-s3-host'}

The only possible way to set endpoint_url would be through FSSPEC_S3, since for a more specific variable name such as FSSPEC_S3_CLIENT_KWARGS__ENDPOINT_URL there is no way to distinguish between an item child of a dict in client_kwargs or an argument whose name is client_kwargs__endpoint_url. This is not a limitation since endpoint_url without any other arguments can be set as:

$ FSSPEC_S3='
{
  "client_kwargs": {
    "endpoint_url": "https://custom-s3-host"
  }
}'

mauvilsa avatar Dec 05 '22 14:12 mauvilsa

It is reasonable to have environment variables interpreted as JSON. Either they would need to be labelled "json:{...}" or we adopt a try/except to interpret all variables. It still leaves the problem where we actually only want to set one nested parameter like endpoint_url, but the user wishes to have freedom to change other ones. Right now, they would need to create the whole set of kwargs, but one could imagine wanting to be more clever about merging sources of arguments. Likewise, this does not solve for cases where the argument is not a simple type (although JSON does at least allow us to differentiate between strings, ints and floats).

martindurant avatar Dec 05 '22 17:12 martindurant

A single variable FSSPEC_{protocol} would always be a dict with at least one entry and never a simple string. It would be safe to always parse it as json/yaml without the need for a json:{...} label. It is true that this does not allow to provide a nested arg such as endpoint_url individually like with FSSPEC_{protocol}_{kwargname}. But at least makes it possible to provide endpoint_url via environment variables, which currently is not.

@martindurant can you please give an examples where the argument is not a simple type. Based on real examples we can iterate over this to extend the proposal and support them.

mauvilsa avatar Dec 06 '22 10:12 mauvilsa

If I were to create a pull request implementing this support, how likely is it to be accepted? Any recommendations before thinking about implementing?

mauvilsa avatar Dec 12 '22 10:12 mauvilsa

Sorry, this fell into the holiday abyss.

Yes, I would be happy to consider a PR, but it may take some iteration to get right.

can you please give an examples where the argument is not a simple type

The timeout parameter to HTTP? It should be a aiohttp.ClientTimeout.

martindurant avatar Jan 10 '23 15:01 martindurant

Yes, I would be happy to consider a PR, but it may take some iteration to get right.

Created #1194.

The timeout parameter to HTTP? It should be a aiohttp.ClientTimeout.

This would be easy to do with jsonargparse, see class-type-and-sub-classes. Not great to add a dependency, but could be implemented as an optional feature only available if jsonargparse is installed.

A downside is that currently in fsspec the parsing of environment variables is unaware of protocols, and there would be a need to restrict which classes are allowed for each protocol. Allowing any class would be a potential security risk. How to restrict the classes could be resolved with a bit of discussion.

mauvilsa avatar Feb 23 '23 21:02 mauvilsa

The timeout parameter to HTTP? It should be a aiohttp.ClientTimeout.

This would be easy to do with jsonargparse, see class-type-and-sub-classes. Not great to add a dependency, but could be implemented as an optional feature only available if jsonargparse is installed.

After a bit of thought it could be as simple as fsspec exposing a way for a particular protocol to register a function that parses. This way it can be made specific to each protocol and people can use whichever parsing framework they wish. The function would receive two string parameters (the name of the protocol and the value to parse) and would return a dict. When a parser is registered, the FSSPEC_{protocol} environment variable would be parsed with it instead of just json.

mauvilsa avatar Feb 27 '23 12:02 mauvilsa