airflow
airflow copied to clipboard
Publish JSON schema for airflow.cfg
Description
There is already a good structured YAML file providing metadata about all valid configuration options in airflow.cfg: airflow/config_templates/config.yml.
I think publishing the same data as a JSON schema and eventually to https://www.schemastore.org/json/ could be very useful.
Use case/motivation
- People could use extensions like Even Better TOML with their IDE to benefit from validation and powerful auto-completion while editing
airflow.cfg - It would be easy to leverage pre-commit hooks or other CI tools to catch mistakes in the config file.
Airflow won't complain if the configuration file contains a typo or a non-existent configuration key making it easy to make mistakes. It could also make it easier to catch invalid values earlier.
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
The (small) problem is that airflow.cfg file is not json. It's 'ini" format. I am not sure if you can validate such format easily. Do you know any tools that can do it and tested it with Airlfow .cfg file @ghjklw ?
Also be aware that we are planning (as part of https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components to migrate the format from ".ini" format to ".toml" format which is de-facto standard for configuration for many python projects now. Will that work with it? Any tools that can do it?
Maybe it should be made as part of that move and maybe you would like to contribute to that effort and actually take part in the .toml conversion and adding validation for the toml file @ghjklw ?
BTW. I know you mentioned "even better toml", but I am asking about CLI tools - somethign that can be used in our pre-commits ad validate the schema in CI. The big problem with such tooling that is IDE-only - is that we are not able to verify if such schema is actually "correct" and validating config files generated automatically during testing would be a good test.
Hi @potiuk
My mistake for assuming airflow.cfg was toml and not ini 🙈
Regarding the tooling for JSON schema with TOML, a fairly easy alternative relying only on largely used robust projects/stdlib would be to read the toml file as a dict using tomllib.load and then validating the dict using jsonschema.validate which actually validates a mapping/dictionary/object and not a string.
See also: https://python-jsonschema.readthedocs.io/en/stable/faq/#can-jsonschema-be-used-to-validate-yaml-toml-etc
An even more powerful solution, but which might require more work depending on how the configuration is implemented today would be to leverage pydantic-settings. We would define the configuration as Pydantic models, creating the JSON schema would be straightforward. Pydantic could handle itself the parsing of the TOML file through the TomlConfigSettingsSource. An added benefit of that approach is that it would create an abstraction layer between the definition of the settings structure and the format they're stored in/how they're parsed. It would then be quite easy to use YAML/JSON... pydantic-settings can also take care of variables defined through environment variables.
Last but not least, check-jsonschema has support for TOML. It can be used both as a CLI tool and as a pre-commit hook.
Unfortunately, I really do not have the bandwidth nor the experience with Airflow's development to offer my help with the implementation, but if anyone wants to work on it, I'd be happy to be a sparring partner/help with testing.
Marked it as "good first issue" - hopefully will pick it up
I can try implementing this, feel free to assign this to me if no one else has started on this, thanks
Assigned :)
Hey @ghjklw, thanks for the detailed feature request, agreed that having validation for airflow.cfg would be very useful.
Also be aware that we are planning (as part of https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components to migrate the format from ".ini" format to ".toml" format which is de-facto standard for configuration for many python projects now. Will that work with it? Any tools that can do it?
Question for @potiuk : should I focus my efforts on validating the existing airflow.cfg in its current ".ini" format, or creating a validation for the newly planned ".toml" migration?
Creating the JSON schema from the airflow/config_templates/config.yml and publishing it to https://www.schemastore.org/json/ should be pretty straightforward as well and I can start on that if everyone agrees.
Question for @potiuk : should I focus my efforts on validating the existing airflow.cfg in its current ".ini" format, or creating a validation for the newly planned ".toml" migration?
Yes. I think toml might not happen
Hi @potiuk, apologies for the earlier unassignment. I am still interested in this topic and have a question about the JSON schema:
The file config.yml.schema.json already exists and appears to use the JSON Schema Draft 07 specification (published at http://json-schema.org/draft-07/schema#). Is this schema also related to the collection on https://www.schemastore.org/json/?
Would appreciate any clarification. Thank you!
Hi @dannyl1u
Thank you very much for looking into it! The file config.yml.schema.json is a JSON Schema describing the structure of config.yml itself, so not really what we're after 😉
As for schemastore.org, the way it works is that when a JSON Schema has been defined, we can ask them to publish it: https://github.com/SchemaStore/schemastore/blob/master/CONTRIBUTING.md#how-to-add-a-json-schema-thats-self-hostedremoteexternal
The point of doing this is that many tools (including most IDEs) will then automatically match it to airflow.cfg so that you get validation and auto-completiom without having to do any manual configuration.
@ghjklw If my understanding is correct:
- Create JSON Schema file using airflow/config_templates/config.yml and publish to https://www.schemastore.org/json/
- Use
pydanticor some other IDE tool to validateairflow.cfgusing the json schema
Regarding (2), do you know of any tools that can be used to validate the .ini format from the json schema? I generated a schema.json file locally using the airflow/config_templates/config.yml and would like to test if my airflow.cfg passes the validation.
Create JSON Schema file using airflow/config_templates/config.yml and publish to https://www.schemastore.org/json/
We could yes - no need to publish it there, it could be likely pointed at directly from Airflow Repository, but eventually submitting it there might be a good idea, however this should be done with Airflow PMC as a driving/controlling entity.
Use pydantic or some other IDE tool to validate airflow.cfg using the json schema
Regarding (2), do you know of any tools that can be used to validate the
.iniformat from the json schema? I generated aschema.jsonfile locally using the airflow/config_templates/config.yml and would like to test if myairflow.cfgpasses the validation.
Since TOML is a superset of .ini - this likely coud work (initially proposed in that issue) https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml - or any other toml validation solutions.
Since TOML is a superset of .ini - this likely coud work (initially proposed in that issue) https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml - or any other toml validation solutions.
toml expects strings to be wrapped in quotes (") and some other differences between .ini and .toml (e.g. True in .ini -> true in .toml). So I'm not sure if simply using a TOML validator will work.
Any suggestions on bridging this gap?
If you already have a JSON Schema, feel free to share it, I'd love to play with it. Something that would be relatively easy to achieve is building a pre-commit hook by just writing a simple python script that parses the airflow.cfg file with configparser and then validates it with jsonschema. That's something I'd be happy to contribute if you want.
@ghjklw Here's a JSON Schema I generated using the config.yml https://github.com/apache/airflow/blob/fdc14432475e3a34b574caf4a98d3e4102083909/airflow/config_templates/schema.json
Let me know if any issues 👍