airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Publish JSON schema for airflow.cfg

Open ghjklw opened this issue 1 year ago • 16 comments

Description

There is already a good structured YAML file providing metadata about all valid configuration options in airflow.cfg: airflow/config_templates/config.yml.

I think publishing the same data as a JSON schema and eventually to https://www.schemastore.org/json/ could be very useful.

Use case/motivation

  • People could use extensions like Even Better TOML with their IDE to benefit from validation and powerful auto-completion while editing airflow.cfg
  • It would be easy to leverage pre-commit hooks or other CI tools to catch mistakes in the config file.

Airflow won't complain if the configuration file contains a typo or a non-existent configuration key making it easy to make mistakes. It could also make it easier to catch invalid values earlier.

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

ghjklw avatar Oct 09 '24 09:10 ghjklw

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

boring-cyborg[bot] avatar Oct 09 '24 09:10 boring-cyborg[bot]

The (small) problem is that airflow.cfg file is not json. It's 'ini" format. I am not sure if you can validate such format easily. Do you know any tools that can do it and tested it with Airlfow .cfg file @ghjklw ?

Also be aware that we are planning (as part of https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components to migrate the format from ".ini" format to ".toml" format which is de-facto standard for configuration for many python projects now. Will that work with it? Any tools that can do it?

Maybe it should be made as part of that move and maybe you would like to contribute to that effort and actually take part in the .toml conversion and adding validation for the toml file @ghjklw ?

potiuk avatar Oct 12 '24 22:10 potiuk

BTW. I know you mentioned "even better toml", but I am asking about CLI tools - somethign that can be used in our pre-commits ad validate the schema in CI. The big problem with such tooling that is IDE-only - is that we are not able to verify if such schema is actually "correct" and validating config files generated automatically during testing would be a good test.

potiuk avatar Oct 12 '24 22:10 potiuk

Hi @potiuk

My mistake for assuming airflow.cfg was toml and not ini 🙈

Regarding the tooling for JSON schema with TOML, a fairly easy alternative relying only on largely used robust projects/stdlib would be to read the toml file as a dict using tomllib.load and then validating the dict using jsonschema.validate which actually validates a mapping/dictionary/object and not a string.

See also: https://python-jsonschema.readthedocs.io/en/stable/faq/#can-jsonschema-be-used-to-validate-yaml-toml-etc

An even more powerful solution, but which might require more work depending on how the configuration is implemented today would be to leverage pydantic-settings. We would define the configuration as Pydantic models, creating the JSON schema would be straightforward. Pydantic could handle itself the parsing of the TOML file through the TomlConfigSettingsSource. An added benefit of that approach is that it would create an abstraction layer between the definition of the settings structure and the format they're stored in/how they're parsed. It would then be quite easy to use YAML/JSON... pydantic-settings can also take care of variables defined through environment variables.

Last but not least, check-jsonschema has support for TOML. It can be used both as a CLI tool and as a pre-commit hook.

Unfortunately, I really do not have the bandwidth nor the experience with Airflow's development to offer my help with the implementation, but if anyone wants to work on it, I'd be happy to be a sparring partner/help with testing.

ghjklw avatar Oct 15 '24 07:10 ghjklw

Marked it as "good first issue" - hopefully will pick it up

potiuk avatar Oct 15 '24 21:10 potiuk

I can try implementing this, feel free to assign this to me if no one else has started on this, thanks

dannyl1u avatar Nov 13 '24 01:11 dannyl1u

Assigned :)

potiuk avatar Nov 13 '24 01:11 potiuk

Hey @ghjklw, thanks for the detailed feature request, agreed that having validation for airflow.cfg would be very useful.

Also be aware that we are planning (as part of https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components to migrate the format from ".ini" format to ".toml" format which is de-facto standard for configuration for many python projects now. Will that work with it? Any tools that can do it?

Question for @potiuk : should I focus my efforts on validating the existing airflow.cfg in its current ".ini" format, or creating a validation for the newly planned ".toml" migration?

Creating the JSON schema from the airflow/config_templates/config.yml and publishing it to https://www.schemastore.org/json/ should be pretty straightforward as well and I can start on that if everyone agrees.

dannyl1u avatar Nov 14 '24 04:11 dannyl1u

Question for @potiuk : should I focus my efforts on validating the existing airflow.cfg in its current ".ini" format, or creating a validation for the newly planned ".toml" migration?

Yes. I think toml might not happen

potiuk avatar Nov 14 '24 11:11 potiuk

Hi @potiuk, apologies for the earlier unassignment. I am still interested in this topic and have a question about the JSON schema:

The file config.yml.schema.json already exists and appears to use the JSON Schema Draft 07 specification (published at http://json-schema.org/draft-07/schema#). Is this schema also related to the collection on https://www.schemastore.org/json/?

Would appreciate any clarification. Thank you!

dannyl1u avatar Dec 20 '24 05:12 dannyl1u

Hi @dannyl1u

Thank you very much for looking into it! The file config.yml.schema.json is a JSON Schema describing the structure of config.yml itself, so not really what we're after 😉

As for schemastore.org, the way it works is that when a JSON Schema has been defined, we can ask them to publish it: https://github.com/SchemaStore/schemastore/blob/master/CONTRIBUTING.md#how-to-add-a-json-schema-thats-self-hostedremoteexternal The point of doing this is that many tools (including most IDEs) will then automatically match it to airflow.cfg so that you get validation and auto-completiom without having to do any manual configuration.

ghjklw avatar Dec 20 '24 06:12 ghjklw

@ghjklw If my understanding is correct:

  1. Create JSON Schema file using airflow/config_templates/config.yml and publish to https://www.schemastore.org/json/
  2. Use pydantic or some other IDE tool to validate airflow.cfg using the json schema

Regarding (2), do you know of any tools that can be used to validate the .ini format from the json schema? I generated a schema.json file locally using the airflow/config_templates/config.yml and would like to test if my airflow.cfg passes the validation.

dannyl1u avatar Dec 20 '24 07:12 dannyl1u

Create JSON Schema file using airflow/config_templates/config.yml and publish to https://www.schemastore.org/json/

We could yes - no need to publish it there, it could be likely pointed at directly from Airflow Repository, but eventually submitting it there might be a good idea, however this should be done with Airflow PMC as a driving/controlling entity.

Use pydantic or some other IDE tool to validate airflow.cfg using the json schema

Regarding (2), do you know of any tools that can be used to validate the .ini format from the json schema? I generated a schema.json file locally using the airflow/config_templates/config.yml and would like to test if my airflow.cfg passes the validation.

Since TOML is a superset of .ini - this likely coud work (initially proposed in that issue) https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml - or any other toml validation solutions.

potiuk avatar Dec 20 '24 10:12 potiuk

Since TOML is a superset of .ini - this likely coud work (initially proposed in that issue) https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml - or any other toml validation solutions.

toml expects strings to be wrapped in quotes (") and some other differences between .ini and .toml (e.g. True in .ini -> true in .toml). So I'm not sure if simply using a TOML validator will work.

Any suggestions on bridging this gap?

dannyl1u avatar Dec 20 '24 23:12 dannyl1u

If you already have a JSON Schema, feel free to share it, I'd love to play with it. Something that would be relatively easy to achieve is building a pre-commit hook by just writing a simple python script that parses the airflow.cfg file with configparser and then validates it with jsonschema. That's something I'd be happy to contribute if you want.

ghjklw avatar Dec 21 '24 07:12 ghjklw

@ghjklw Here's a JSON Schema I generated using the config.yml https://github.com/apache/airflow/blob/fdc14432475e3a34b574caf4a98d3e4102083909/airflow/config_templates/schema.json

Let me know if any issues 👍

dannyl1u avatar Dec 21 '24 07:12 dannyl1u