astronomer-cosmos icon indicating copy to clipboard operation
astronomer-cosmos copied to clipboard

[Bug] Cosmos stale temporary directories

Open tatiana opened this issue 9 months ago • 6 comments

Context

It seems sometimes Cosmos is creating and not deleting temporary directories.

An example of a report, from 30 April 2024 in the #airflow-dbt Slack channel: https://apache-airflow.slack.com/archives/C059CC42E9W/p1714484749579599

Robert Jones: Good day I'm running cosmos 1.3.2 on MWAA 2.7.2. I'm using the load_method=LoadMode.DBT_MANIFEST in the RenderConfig. But the tmp is filling up over time with dirs that are mostly symlinks but then there is also a target directories that has the compiled code and graph.gpickle. Does cosmos come with a cleanup mechanism or do I need to build a DAG that cleans the tmp directory?

Here are some examples:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - /tmp/tmp0x486tay/target:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - total 38M
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 4 airflow airflow 4.0K Apr 30 11:37 .
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 ..
[2024-04-30T12:45:15.479+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 compiled
[2024-04-30T12:45:15.481+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 6.4M Apr 30 11:36 graph.gpickle
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 843K Apr 30 11:36 graph_summary.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  17M Apr 30 11:37 manifest.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  14M Apr 30 11:36 partial_parse.msgpack
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 run
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 3.9K Apr 30 11:37 run_results.json
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  234 Apr 30 11:37 semantic_manifest.json
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp0m2gkxru
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp0x486tay
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:45 tmp0zf1jeip
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp12fmwv6j
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp1i3_4i36
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmp20rc3co2
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmp24zhr1yr
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp35wco4yc
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmp3_ugz4uy
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp3nk2g9_w
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp3pverouw
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp4md00dzk
[2024-04-30T12:45:15.312+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp59t5kcsa
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp5_yn3j90
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - -rw------- 1 airflow airflow  13K Apr 30 11:31 tmp5c_o7q6w
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp5di13bsx
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmp5p8h7bk2
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp5zdqwdez
[2024-04-30T12:45:15.315+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp62enlhb2
[2024-04-30T12:45:15.317+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp6gv98dj3
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp74qo5pm7
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp86bttjpg
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp8_601uyf
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp8p1if86h
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp99tdocou
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp9l6ljh4p
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp9tx575tl
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmp_0uymz3x
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp_ph3232z
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmpbtvqwlhb
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpd1eb6y2y
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpe_gz5es8
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 02:15 tmpeaqsvtm4
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmpem8_pa52
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpemz78wrw
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpf3v5iz3_
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpfb8rvq9v
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpho39phpd
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmphxru1hq8
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpiamgv8ox
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpj_k0_9h4
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpjg_xwyd9
[2024-04-30T12:45:15.328+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmpjpxu2ltp
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjr32_bt0
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:38 tmpjs2i5xmv
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjsg9rdn6
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 02:15 tmpkcw52xq4
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpksxjhhda
[2024-04-30T12:45:15.332+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmplkcqmlm8
[2024-04-30T12:45:15.333+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpn57z5j0_
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmpna7v18jf
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpne2u4xqd
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmpnpbketqi
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpnsrts_b8
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:43 tmpp6qk2k5w
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmppf0pprft
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmppjrk253b
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 02:15 tmpq5_8ma32
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpq6blaw7z
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpr5zqdd56
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpsrpybyij
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmptcyhp6is
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpu591ld0j
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpuop7l2k4
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpvox078h7
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpx6jq5_ct
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 02:15 tmpxi6odkmu
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:46 tmpxpsqdy3a
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpy3vhu055
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmpyi3sxcqm
[2024-04-30T12:45:15.362+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpzcq8r5_a
[2024-04-30T12:45:15.365+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmpzw4pw0ve

There have been previous discussions in the core Airflow https://github.com/apache/airflow/issues/22404 about tempfile.TemporaryDirectory doesn't necessarily behaving as expected.

I was not able to reproduce this issue yet, but one possibility is that there is some exception or error, and the context manager tempfile.TemporaryDirectory is not being able to clean things after those scenarios.

Acceptance criteria

  • [ ] Review source code to confirm we are always using the context manager.
  • [ ] Improve how we generate temporary directories so we can know if they are being generated by DAG parsing or task execution (in this particular report, it was created by task execution)
  • [ ] Reproduce the issue, if possible
  • [ ] Propose solutions

tatiana avatar May 14 '24 08:05 tatiana

We are experiencing a similar issue in our MWAA environment:

  • astronomer-cosmos version: 1.4.1
  • MWAA/Airflow version: 2.6.3
  • MWAA configuration: mw1-large (4 vCPUs, 8 GB RAM, up to 1000 DAG capacity, 20 concurrent tasks)
  • Number of dbt DAGs: 300+
  • All MWAA workers are failing dbt DAGs/tasks due to OSError: [Errno 28] No space left on device.

I have done some investigation:

  • Worker nodes have exhausted 30 GB of disk space.
  • There are hundreds of files (manifest.json and partial_parse.msgpack) totalling ~26 MB in each /tmp/cosmos/<...>/target directory:
df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
overlay          30G   28G   48M 100% /
sudo du -ah /tmp | sort -n -r | head -n 200
26M	/tmp/cosmos/wf_12__DBTGrp_1/target
26M	/tmp/cosmos/wf_13__DBTGrp_1/target
26M	/tmp/cosmos/wf_14__DBTGrp_1/target
26M	/tmp/cosmos/wf_15__DBTGrp_1/target
26M	/tmp/cosmos/wf_15__DBTGrp_1/target
...
...
ls -la /tmp/cosmos/wf_12__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow     4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow     4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13667800 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
ls -la /tmp/cosmos/wf_13__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow     4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow     4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13659451 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack

maslick avatar May 27 '24 07:05 maslick

Rodrigo Rabioglio reported the same issue in the #airflow-dbt Slack channel:

Hello I'm using cosmos==1.3.0 with mwaa 2.7.2 Im getting a DAG import error on MWAA due to lack of disk space. It happens only with astronomer-cosmos dbt dags, as follows. Broken DAG: [/usr/local/airflow/dags/MY_DAG.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/tempfile.py", line 854, in init self.name = mkdtemp(suffix, prefix, dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/tempfile.py", line 368, in mkdtemp _os.mkdir(file, 0o700) OSError: [Errno 28] No space left on device: '/tmp/tmp5u4mvh_k' I'm wondering what file can be topping disk space on device :thinking_face: I can't access the underlying mwaa container disk to evaluate it. My dbtdag RenderConfig uses LOAD_METHOD = LoadMode.DBT_LS and the execution mode is set to execution_mode=ExecutionMode.VIRTUALENV,

tatiana avatar Jun 10 '24 15:06 tatiana

I believe the issue Rodrigo is facing is likely due to https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L72

This will only be executed in case the operator execution succeeds: https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L106-L107

We should change this to use context manager.

From Cosmos 1.4 onwards, we also received reports that caching locally the partial parse file was leading to issues in MWAA: https://github.com/astronomer/astronomer-cosmos/pull/1025#issuecomment-2160827831

This will hopefully be solved once #927 is implemented.

tatiana avatar Jun 11 '24 22:06 tatiana

We are encountering the same issue with temporary directories not being deleted.

  • Cosmos version: 1.5.0
  • Cloud Composer version: 2.4.6
  • Airflow version: 2.6.3

It looks like none of the /tmp/cosmos-venv* directories are being deleted.

To reproduce this issue locally, I have been using composer-local-dev.

I added a line to manually remove the directories using

shutil.rmtree(self.virtualenv_dir, ignore_errors=True)

after the self._release_venv_lock() line,

https://github.com/astronomer/astronomer-cosmos/blob/1f1fc61e637a5e66b1de483200782a7088246261/cosmos/operators/virtualenv.py#L120

which resolved the issue of stale directories. However, I am unsure why this manual removal is necessary or why the directories are not automatically deleted.

Should I create a pull request with this fix, or should we investigate further to identify the root cause?

What we know so far:

  • The issue still occurs with the latest commit on the main branch.
    • I am debugging by adding logs to the main branch.
  • virtualenv_dir is unexpectedly being set to None during execution, which seems to prevent the automatic cleanup.
  • virtualenv_dir has its expected value at line self.log.info("Releasing virtualenv lock")
    • https://github.com/astronomer/astronomer-cosmos/blob/1f1fc61e637a5e66b1de483200782a7088246261/cosmos/operators/virtualenv.py#L119
  • By the time execution reached line self.clean_dir_if_temporary(), the value of virtualenv_dir had already been set to None.
    • https://github.com/astronomer/astronomer-cosmos/blob/1f1fc61e637a5e66b1de483200782a7088246261/cosmos/operators/virtualenv.py#L136
  • The TemporaryDirectory context manager is supposed to delete the directory automatically when exiting the with block.
    • However, this does not appear to be happening in this case.

kesompochy avatar Sep 03 '24 11:09 kesompochy

https://github.com/astronomer/astronomer-cosmos/issues/958#issuecomment-2326243062

I noticed that the temporary directories /tmp/cosmos-venv* are deleted when is_virtualenv_dir_temporary is set to True in [email protected]. However, the issue occurs when is_virtualenv_dir_temporary is set to False. In both cases, virtualenv_dir is None (default value). When virtualenv_dir is None, I expect temporary directories should be cleaned up after task execution regardless of the is_virtualenv_dir_temporary setting.

kesompochy avatar Sep 04 '24 04:09 kesompochy

I've observed the following behavior related to this issue in version 1.6.0: When invoke_dbt is called twice in DbtLocalBaseOperator.run_command, the temporary directories are not deleted when DbtVirtualenvBaseOperator.virtualenv_dir is initially None. In the first execution, virtualenv_dir is None, so a temporary virtualenv is created and virtualenv_dir is set to its path. In the second execution, virtualenv_dir is not None, so a new directory is created and it will not be deleted. This behavior is particularly noticeable when DbtLocalBaseOperator.install_deps is set to True, which causes invoke_dbt to be called twice. Conversely, if invoke_dbt is called only once, the directory is properly deleted when virtualenv_dir is initially None. Is this the intended behavior?

kesompochy avatar Sep 04 '24 13:09 kesompochy