bazel-buildfarm icon indicating copy to clipboard operation
bazel-buildfarm copied to clipboard

Error creating exec dir

Open aghoussaini opened this issue 3 years ago • 2 comments

Worker side error:

SEVERE: error creating exec dir for memory/operations/684d7a47-2132-4f8e-9acd-8abc18b54453
build.buildfarm.cas.cfc.CASFileCache$PutDirectoryException: \tmp\worker\cache\4d69739083e5d79601c7c729bcb16bd620099e231e2c1f4824337788ebfd63db_dir: 1 exceptions
	at build.buildfarm.cas.cfc.CASFileCache.lambda$putDirectorySynchronized$20(CASFileCache.java:2138)
	at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
	at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
	at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:321)
	at com.google.common.util.concurrent.MoreExecutors$5.execute(MoreExecutors.java:1108)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:724)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:126)
	at com.google.common.util.concurrent.AbstractTransformFuture.create(AbstractTransformFuture.java:39)
	at com.google.common.util.concurrent.Futures.transformAsync(Futures.java:446)
	at build.buildfarm.cas.cfc.CASFileCache.lambda$putDirectorySynchronized$21(CASFileCache.java:2119)
	at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
	at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
	at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:321)
	at com.google.common.util.concurrent.MoreExecutors$5.execute(MoreExecutors.java:1108)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:724)
	at com.google.common.util.concurrent.FluentFuture$TrustedFuture.addListener(FluentFuture.java:110)
	at com.google.common.util.concurrent.AbstractTransformFuture.create(AbstractTransformFuture.java:39)
	at com.google.common.util.concurrent.Futures.transformAsync(Futures.java:446)
	at build.buildfarm.cas.cfc.CASFileCache.putDirectorySynchronized(CASFileCache.java:2107)
	at build.buildfarm.cas.cfc.CASFileCache.putDirectory(CASFileCache.java:1965)
	at build.buildfarm.worker.memory.OperationQueueWorkerContext.linkDirectory(OperationQueueWorkerContext.java:574)
	at build.buildfarm.worker.memory.OperationQueueWorkerContext.fetchInputs(OperationQueueWorkerContext.java:565)
	at build.buildfarm.worker.memory.OperationQueueWorkerContext.createExecDir(OperationQueueWorkerContext.java:310)
	at build.buildfarm.worker.InputFetcher.fetchPolled(InputFetcher.java:198)
	at build.buildfarm.worker.InputFetcher.runInterruptibly(InputFetcher.java:101)
	at build.buildfarm.worker.InputFetcher.run(InputFetcher.java:288)
	at java.base/java.lang.Thread.run(Thread.java:834)
	Suppressed: java.lang.InterruptedException
		at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523)
		at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:88)
		at build.buildfarm.cas.cfc.CASFileCache.lambda$putDirectorySynchronized$20(CASFileCache.java:2129)
		... 30 more
	Suppressed: java.lang.IllegalStateException: inputDirectory 4d69739083e5d79601c7c729bcb16bd620099e231e2c1f4824337788ebfd63db/218 is not in directoryStorage
		at build.buildfarm.cas.cfc.CASFileCache.decrementReferencesSynchronized(CASFileCache.java:1583)
		at build.buildfarm.cas.cfc.CASFileCache.decrementReferences(CASFileCache.java:1544)
		at build.buildfarm.worker.memory.OperationQueueWorkerContext$ReferenceTransaction.close(OperationQueueWorkerContext.java:284)
		at build.buildfarm.worker.memory.OperationQueueWorkerContext.createExecDir(OperationQueueWorkerContext.java:309)
		... 4 more

From the server side:

Jul 18, 2022 4:48:33 PM build.buildfarm.server.BuildFarmServer <init>
INFO: buildfarm-server-159017b1-5b30-4869-b4e8-5ab3dfdec393 initialized
Jul 18, 2022 4:48:34 PM build.buildfarm.metrics.prometheus.PrometheusPublisher startHttpServer
INFO: Started Prometheus HTTP Server on port 9090
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: Operation memory/operations/3ec8078b-7cd6-440b-bf2c-5567cb996f39 was created
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: Operation memory/operations/684d7a47-2132-4f8e-9acd-8abc18b54453 was created
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: Operation memory/operations/22fe982b-9227-42e2-98b9-f1086deb0a0d was created
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: Operation memory/operations/1c5005f8-cba5-4800-9911-258cdcd5ac94 was created
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: memory::execute(7d1770716f03aff1e4290501a65fdb54595aad1af7804e5e111cbd965c616692/142): memory/operations/22fe982b-9227-42e2-98b9-f1086deb0a0d
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: memory::execute(073825d05cf580894d7f248db92e3c974917a348017f1f21d12c95b22b747323/142): memory/operations/684d7a47-2132-4f8e-9acd-8abc18b54453
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: memory::execute(e4f2feb63bf95eeed83f0cb1c5239217a47fc99e3875d38a7b27f8e9f2d6b124/142): memory/operations/3ec8078b-7cd6-440b-bf2c-5567cb996f39
Jul 18, 2022 4:56:08 PM build.buildfarm.instance.server.AbstractServerInstance execute
INFO: memory::execute(9864879d3a1e4a1c28f857f19ccb499e6b87205df1d5c1112b4d6bee05f15d9a/142): memory/operations/1c5005f8-cba5-4800-9911-258cdcd5ac94
Jul 18, 2022 4:56:38 PM build.buildfarm.instance.memory.MemoryInstance lambda$onDispatched$4
INFO: REQUEUEING memory/operations/684d7a47-2132-4f8e-9acd-8abc18b54453
Jul 18, 2022 4:56:38 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(073825d05cf580894d7f248db92e3c974917a348017f1f21d12c95b22b747323/142): memory/operations/684d7a47-2132-4f8e-9acd-8abc18b54453
Jul 18, 2022 4:57:38 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(073825d05cf580894d7f248db92e3c974917a348017f1f21d12c95b22b747323/142): memory/operations/684d7a47-2132-4f8e-9acd-8abc18b54453
Jul 18, 2022 4:58:08 PM build.buildfarm.instance.memory.MemoryInstance lambda$onDispatched$4
INFO: REQUEUEING memory/operations/3ec8078b-7cd6-440b-bf2c-5567cb996f39
Jul 18, 2022 4:58:08 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(e4f2feb63bf95eeed83f0cb1c5239217a47fc99e3875d38a7b27f8e9f2d6b124/142): memory/operations/3ec8078b-7cd6-440b-bf2c-5567cb996f39

Basically, the server is requeuing the operation indefinetely.

What I have tried:

  • Increasing input_fetch_deadline
  • setting link_input_directories to false

aghoussaini avatar Jul 18 '22 14:07 aghoussaini

Fascinating. The directory creation was interrupted midstream, but a reference to the directory was added to a registry of dirs that should be destroyed in case of a failure. This only affects memory (OperationQueue) workers, as the order of addition was already switched in shard. I'll put up a change shortly to address this.

For what it's worth, your directory creation is also stalled long enough to incur the interrupt in the first place. If this is the case, it may be that your CAS storage is not sufficiently large to maintain the working set of blob entries for a pathological set of actions: if all of the CAS entries have references, and all reference holders are not executing (i.e. have completed their exec input population), you can deadlock due to a lack of available storage space. Check out the cas_cache_max_size_bytes config in the worker and see if expanding it alleviates this issue.

werkt avatar Jul 24 '22 04:07 werkt

Fascinating. The directory creation was interrupted midstream, but a reference to the directory was added to a registry of dirs that should be destroyed in case of a failure. This only affects memory (OperationQueue) workers, as the order of addition was already switched in shard. I'll put up a change shortly to address this.

For what it's worth, your directory creation is also stalled long enough to incur the interrupt in the first place. If this is the case, it may be that your CAS storage is not sufficiently large to maintain the working set of blob entries for a pathological set of actions: if all of the CAS entries have references, and all reference holders are not executing (i.e. have completed their exec input population), you can deadlock due to a lack of available storage space. Check out the cas_cache_max_size_bytes config in the worker and see if expanding it alleviates this issue.

Thank you for your reply.

Unfortunately, expanding cas_cache_max_size_bytes did not fix the issue. I'm guessing perhaps it's a platform issue (I am using Windows).

Weirdly enough, after pulling the latest version of buildfarm, and setting link_input_directories to false, the issue was fixed. The same fix did not work on an older version of buildfarm.

aghoussaini avatar Jul 25 '22 07:07 aghoussaini