citus icon indicating copy to clipboard operation
citus copied to clipboard

PG_JOB_CACHE_DIR disallowed as mount (tmpfs backed)

Open wdoekes opened this issue 4 weeks ago • 1 comments

Hi!

I'm investigating excessive load in a particular setup. Here we notice many many small writes (38 bytes) on this ZFS backed filesystem; which look like they cause a write amplification there. They appear to be in PG_JOB_CACHE_DIR.

I thought I'd mount tmpfs on that path -- or at least something with a lot less strong recovery guarantees -- but then I ran into this:

2025-11-26 16:39:32 UTC [45]: [1-1] 69272d44.2d 0     FATAL:  could not remove file "base/pgsql_job_cache": Device or resource busy
2025-11-26 16:39:32 UTC [45]: [2-1] 69272d44.2d 0     LOG:  database system is shut down

I traced that back to the citus startup, where rmdir/mkdir failures are fatal. This effectively makes putting the cache on a different filesystem impossible.

My initial minimal patch would look like this:

diff --git a/src/backend/distributed/utils/directory.c b/src/backend/distributed/utils/directory.c
index 6701bf8fb..fcf2745f2 100644
--- a/src/backend/distributed/utils/directory.c
+++ b/src/backend/distributed/utils/directory.c
@@ -32,7 +32,13 @@ CitusCreateDirectory(StringInfo directoryName)
 	int makeOK = MakePGDirectory(directoryName->data);
 	if (makeOK != 0)
 	{
-		ereport(ERROR, (errcode_for_file_access(),
+		/*
+		 * Don't raise an ERROR here. If we do, we cannot use a (bind)
+		 * mount to move the job path to another filesystem (type).
+		 * (Postgres treats ERRORs as fatal and aborts the current task.
+		 * That also applies to the initialize task.)
+		 */
+		ereport((errno == EEXIST ? WARNING : ERROR), (errcode_for_file_access(),
 						errmsg("could not create directory \"%s\": %m",
 							   directoryName->data)));
 	}
@@ -147,7 +153,13 @@ CitusRemoveDirectory(const char *filename)
 
 		if (removed != 0 && errno != ENOENT)
 		{
-			ereport(ERROR, (errcode_for_file_access(),
+			/*
+			 * Don't raise an ERROR here. If we do, we cannot use a (bind)
+			 * mount to move the job path to another filesystem (type).
+			 * (Postgres treats ERRORs as fatal and aborts the current task.
+			 * That also applies to the initialize task.)
+			 */
+			ereport(WARNING, (errcode_for_file_access(),
 							errmsg("could not remove file \"%s\": %m", filename)));
 		}
 

Thoughts:

  • this does affect all calls to CitusCreateDirectory, but that is only used for PG_JOB_CACHE_DIR, so not a problem;
  • same for CitusRemoveDirectory;
  • I don't like that CitusRemoveDirectory and CitusCreateDirectory have different function signatures, that's a bit smelly;
  • for a better fix, we could check if the path is a mount and then skip everything, and/or replace the CitusRemoveDirectory+CitusCreateDirectory with a CitusClearDirectory that does everything except remove the parent;
  • I'd also consider moving PG_JOB_CACHE_DIR to a configuration option, but the above fixes will still be needed.

Let me know what you think / how you would solve this.

Cheers, Walter Doekes OSSO B.V.

wdoekes avatar Nov 27 '25 10:11 wdoekes

This effectively makes putting the cache on a different filesystem impossible.

Ok. Nothing is impossible. But it is kludgy: https://git.osso.nl/pub/docker/patroni-citus/-/commit/ed0eb2e2a3fae5cf0e99497ba7960b38de240960

wdoekes avatar Nov 27 '25 14:11 wdoekes