astronomer-cosmos
astronomer-cosmos copied to clipboard
[Bug] Cosmos stale temporary directories
Context
It seems sometimes Cosmos is creating and not deleting temporary directories.
An example of a report, from 30 April 2024 in the #airflow-dbt Slack channel: https://apache-airflow.slack.com/archives/C059CC42E9W/p1714484749579599
Robert Jones: Good day I'm running cosmos 1.3.2 on MWAA 2.7.2. I'm using the load_method=LoadMode.DBT_MANIFEST in the RenderConfig. But the tmp is filling up over time with dirs that are mostly symlinks but then there is also a target directories that has the compiled code and graph.gpickle. Does cosmos come with a cleanup mechanism or do I need to build a DAG that cleans the tmp directory?
Here are some examples:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - /tmp/tmp0x486tay/target:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - total 38M
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 4 airflow airflow 4.0K Apr 30 11:37 .
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 ..
[2024-04-30T12:45:15.479+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 compiled
[2024-04-30T12:45:15.481+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 6.4M Apr 30 11:36 graph.gpickle
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 843K Apr 30 11:36 graph_summary.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 17M Apr 30 11:37 manifest.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 14M Apr 30 11:36 partial_parse.msgpack
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 run
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 3.9K Apr 30 11:37 run_results.json
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 234 Apr 30 11:37 semantic_manifest.json
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp0m2gkxru
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp0x486tay
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:45 tmp0zf1jeip
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp12fmwv6j
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp1i3_4i36
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmp20rc3co2
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmp24zhr1yr
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp35wco4yc
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmp3_ugz4uy
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp3nk2g9_w
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp3pverouw
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp4md00dzk
[2024-04-30T12:45:15.312+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp59t5kcsa
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp5_yn3j90
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - -rw------- 1 airflow airflow 13K Apr 30 11:31 tmp5c_o7q6w
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp5di13bsx
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmp5p8h7bk2
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp5zdqwdez
[2024-04-30T12:45:15.315+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp62enlhb2
[2024-04-30T12:45:15.317+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp6gv98dj3
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp74qo5pm7
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp86bttjpg
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp8_601uyf
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp8p1if86h
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp99tdocou
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp9l6ljh4p
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp9tx575tl
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmp_0uymz3x
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp_ph3232z
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmpbtvqwlhb
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpd1eb6y2y
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpe_gz5es8
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 02:15 tmpeaqsvtm4
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmpem8_pa52
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpemz78wrw
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpf3v5iz3_
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpfb8rvq9v
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpho39phpd
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmphxru1hq8
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpiamgv8ox
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpj_k0_9h4
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpjg_xwyd9
[2024-04-30T12:45:15.328+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmpjpxu2ltp
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjr32_bt0
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:38 tmpjs2i5xmv
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjsg9rdn6
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 02:15 tmpkcw52xq4
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpksxjhhda
[2024-04-30T12:45:15.332+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmplkcqmlm8
[2024-04-30T12:45:15.333+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpn57z5j0_
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmpna7v18jf
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpne2u4xqd
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmpnpbketqi
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpnsrts_b8
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:43 tmpp6qk2k5w
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmppf0pprft
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmppjrk253b
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 02:15 tmpq5_8ma32
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpq6blaw7z
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpr5zqdd56
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpsrpybyij
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmptcyhp6is
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpu591ld0j
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpuop7l2k4
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpvox078h7
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpx6jq5_ct
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 02:15 tmpxi6odkmu
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:46 tmpxpsqdy3a
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpy3vhu055
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmpyi3sxcqm
[2024-04-30T12:45:15.362+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpzcq8r5_a
[2024-04-30T12:45:15.365+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmpzw4pw0ve
There have been previous discussions in the core Airflow https://github.com/apache/airflow/issues/22404 about tempfile.TemporaryDirectory doesn't necessarily behaving as expected.
I was not able to reproduce this issue yet, but one possibility is that there is some exception or error, and the context manager tempfile.TemporaryDirectory
is not being able to clean things after those scenarios.
Acceptance criteria
- [ ] Review source code to confirm we are always using the context manager.
- [ ] Improve how we generate temporary directories so we can know if they are being generated by DAG parsing or task execution (in this particular report, it was created by task execution)
- [ ] Reproduce the issue, if possible
- [ ] Propose solutions
We are experiencing a similar issue in our MWAA environment:
- astronomer-cosmos version: 1.4.1
- MWAA/Airflow version: 2.6.3
- MWAA configuration: mw1-large (4 vCPUs, 8 GB RAM, up to 1000 DAG capacity, 20 concurrent tasks)
- Number of dbt DAGs: 300+
- All MWAA workers are failing dbt DAGs/tasks due to
OSError: [Errno 28] No space left on device
.
I have done some investigation:
- Worker nodes have exhausted 30 GB of disk space.
- There are hundreds of files (
manifest.json
andpartial_parse.msgpack
) totalling ~26 MB in each/tmp/cosmos/<...>/target
directory:
df -h /tmp
Filesystem Size Used Avail Use% Mounted on
overlay 30G 28G 48M 100% /
sudo du -ah /tmp | sort -n -r | head -n 200
26M /tmp/cosmos/wf_12__DBTGrp_1/target
26M /tmp/cosmos/wf_13__DBTGrp_1/target
26M /tmp/cosmos/wf_14__DBTGrp_1/target
26M /tmp/cosmos/wf_15__DBTGrp_1/target
26M /tmp/cosmos/wf_15__DBTGrp_1/target
...
...
ls -la /tmp/cosmos/wf_12__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow 4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow 4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13667800 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
ls -la /tmp/cosmos/wf_13__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow 4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow 4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13659451 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
Rodrigo Rabioglio reported the same issue in the #airflow-dbt Slack channel:
Hello I'm using cosmos==1.3.0 with mwaa 2.7.2 Im getting a DAG import error on MWAA due to lack of disk space. It happens only with astronomer-cosmos dbt dags, as follows. Broken DAG: [/usr/local/airflow/dags/MY_DAG.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/tempfile.py", line 854, in init self.name = mkdtemp(suffix, prefix, dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/tempfile.py", line 368, in mkdtemp _os.mkdir(file, 0o700) OSError: [Errno 28] No space left on device: '/tmp/tmp5u4mvh_k' I'm wondering what file can be topping disk space on device :thinking_face: I can't access the underlying mwaa container disk to evaluate it. My dbtdag RenderConfig uses LOAD_METHOD = LoadMode.DBT_LS and the execution mode is set to execution_mode=ExecutionMode.VIRTUALENV,
I believe the issue Rodrigo is facing is likely due to https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L72
This will only be executed in case the operator execution succeeds: https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L106-L107
We should change this to use context manager.
From Cosmos 1.4 onwards, we also received reports that caching locally the partial parse file was leading to issues in MWAA: https://github.com/astronomer/astronomer-cosmos/pull/1025#issuecomment-2160827831
This will hopefully be solved once #927 is implemented.
We are encountering the same issue with temporary directories not being deleted.
- Cosmos version: 1.5.0
- Cloud Composer version: 2.4.6
- Airflow version: 2.6.3
It looks like none of the /tmp/cosmos-venv*
directories are being deleted.
To reproduce this issue locally, I have been using composer-local-dev.
I added a line to manually remove the directories using
shutil.rmtree(self.virtualenv_dir, ignore_errors=True)
after the self._release_venv_lock()
line,
https://github.com/astronomer/astronomer-cosmos/blob/1f1fc61e637a5e66b1de483200782a7088246261/cosmos/operators/virtualenv.py#L120
which resolved the issue of stale directories. However, I am unsure why this manual removal is necessary or why the directories are not automatically deleted.
Should I create a pull request with this fix, or should we investigate further to identify the root cause?
What we know so far:
- The issue still occurs with the latest commit on the main branch.
- I am debugging by adding logs to the main branch.
-
virtualenv_dir
is unexpectedly being set to None during execution, which seems to prevent the automatic cleanup. -
virtualenv_dir
has its expected value at lineself.log.info("Releasing virtualenv lock")
- https://github.com/astronomer/astronomer-cosmos/blob/1f1fc61e637a5e66b1de483200782a7088246261/cosmos/operators/virtualenv.py#L119
- By the time execution reached line
self.clean_dir_if_temporary()
, the value ofvirtualenv_dir
had already been set to None.- https://github.com/astronomer/astronomer-cosmos/blob/1f1fc61e637a5e66b1de483200782a7088246261/cosmos/operators/virtualenv.py#L136
- The
TemporaryDirectory
context manager is supposed to delete the directory automatically when exiting the with block.- However, this does not appear to be happening in this case.
https://github.com/astronomer/astronomer-cosmos/issues/958#issuecomment-2326243062
I noticed that the temporary directories /tmp/cosmos-venv*
are deleted when is_virtualenv_dir_temporary
is set to True
in [email protected]. However, the issue occurs when is_virtualenv_dir_temporary
is set to False
. In both cases, virtualenv_dir
is None
(default value).
When virtualenv_dir
is None
, I expect temporary directories should be cleaned up after task execution regardless of the is_virtualenv_dir_temporary
setting.
I've observed the following behavior related to this issue in version 1.6.0:
When invoke_dbt
is called twice in DbtLocalBaseOperator.run_command
, the temporary directories are not deleted when DbtVirtualenvBaseOperator.virtualenv_dir
is initially None. In the first execution, virtualenv_dir
is None
, so a temporary virtualenv is created and virtualenv_dir
is set to its path. In the second execution, virtualenv_dir
is not None
, so a new directory is created and it will not be deleted.
This behavior is particularly noticeable when DbtLocalBaseOperator.install_deps
is set to True
, which causes invoke_dbt
to be called twice. Conversely, if invoke_dbt
is called only once, the directory is properly deleted when virtualenv_dir
is initially None.
Is this the intended behavior?