airflow
airflow copied to clipboard
`_airflow_parsing_context_manager` writes to os.environ after program has started
Apache Airflow version
2.8.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
After observing _airflow_parsing_context_manager in airflow/utils/dag_parsing_context.py i can see that it is modifying the os.environ when entering the context manager.
Writing to os.environ is generally a bad idea, because lower level getenv/setenv functions are not thread safe. Meaning that, unless you have 100% control over that no other threads have been spawned yet, the only way to ensure thread safety is to never write to environment [1]
In the context where this code is executing, i believe this is very hard to guarantee.
From what i can tell, these envvars are never used outside of this python module, so i believe there is no need for these to be environment variables. Unless there is an expectation of it being passed through several airflow invocations.
If that is the case, the solution in this case is simple, instead of using environment variables, one can use Pythons contextlib.ContextVar.
Plain global python variables here would also work equally good.
[1] https://news.ycombinator.com/item?id=37908655
What you think should happen instead?
No response
How to reproduce
See above
Operating System
NA
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Feel free to raise a PR
Just a comment. I am not sure if that is a problem. If you look at where the _airflow_parsing_context_manager is used, it is used, right before running subproces.Popen or right after the process has been forked and setproctitle called on it - which in fact means, that yeah, we are 100% sure we have no other threads running at this time. And we actualy only start Threads after that (so actually even outside of the context manager):
https://github.com/apache/airflow/blob/main/airflow/task/task_runner/base_task_runner.py#L134 https://github.com/apache/airflow/blob/main/airflow/providers/celery/executors/celery_executor_utils.py#L130 https://github.com/apache/airflow/blob/main/airflow/task/task_runner/standard_task_runner.py#L101
And the env variables are used in this case because the environment variables are then passed to subprocesses without modifying the process parameters (if the child process is started by spawning a new interpreter) and automatically available to forked processes when start_by_fork is used. none of the "spawn/fork" is affected by thread-safety (those new processes, not threads) so there is no problem with thread-safety, Basically at the moment the context manager is used there is only single main thread running in the process, so IMHO yes - we are 100% sure that no other threads have been started yet.
And benefit of using env variables in this case is that passing and reading them works the same - regardless if the child process is spawned via Popen in a new interpreter or whether it's forked.
Plain Global variable could work for forks - because global variables are available in forks, but if you want to impleement replsacement, you will have to also handle passing the context to subprocesses spawned by Popen - for exmple via extra arguments passed or (surprise) via env variable.
You can probably figure two different mechanisms to pass and read the values for those two cases, but IMHO it's not needed.
WDYT @hterik ? How did you want to implement passing of the context to spawned subproceses? Global Python variable is not nearly good enough for that).
Ok, i was not aware how this was used to passed through forks and subprocesses.
If it's required and you are sure about where this is called it should be fine. I don't have any better suggestion than global variables for forks and explicitly setting the env on POpen for base_task_runner. I also haven't seen any real-life issues caused by this, it was just a code review observation.
On second thought i also think the GIL might reduce those risks in Python, not sure. Feel free to close this if you agree.
Yeah. You are completely right that GIL is an extra guard here as well. Even if you use Threads, in Python, GIL will absolulely prevent two parallel threads to run C functions like getenv/setenv to interfere with each other. This is also why we very rarely use threads in Python in general (and in Airflow) because they are not very useful for parallelism - except the situation where most of the threads are waiting for I/O operations.
I think in C/Java those concerns about thread safety are valid, but in Python they are very, very unlikely to cause problems and it's pretty common pattern to pass things around through env variables in Python.
This might change with https://peps.python.org/pep-0703/ - when we will be able to have Python run with --no-gil option, but there are likely many years until it becomes a usable option at all and then we would have to probably adapt airlfow in various ways to make good use of it, so I would not complicate the code for that possibilty just yet. There will be likely (if we do it) a bigger effort involved requiring re-architecturing a number of things in Airflow when it happens - including changing our decisions how we run our tasks (for example I think it is more likely that we change the way how we fork/spawn into using sub-interpreters https://realpython.com/python312-subinterpreters/ way faster then exploring no-GIL. But that requires Python 3.12 support at the very least.
And yeah. I think closing it makes sense for now :).