bazel-buildfarm
bazel-buildfarm copied to clipboard
createLink while filePath exists
Hello,
In file CASFileCache.java, method putDirectoryFiles calls Files.createLink(filePath, cacheFilePath) without any try-catch. We found that in rare cases, the filePath symlink already exists before createLink is called. Every time we run that case in our project, this bug always shows up. It leads to an endless operation requeue.
We tried to give a simple example to reproduce this bug but failed. We don't know why this particular symlink always exists before createLink is called. How about adding try-catch to catch FileAlreadyExistsException here?
Of note, createLink actually creates a hard link, not a symbolic one.
Just to be clear, you're ending up with an ExecDirException in the worker log, if I'm tracing through this correctly, right?
As far as the link already existing, I'm going to need a little more information: are you certain that your workers are not sharing a filesystem/cache root, or that you only have a single worker, and that each isolated worker has only a single CASFileCache instance that refers to a single filesystem? There are extraordinary steps taken to ensure that within a single worker, directories are populated by a single actor, so I want to ensure that we exist within that lockspace.
"Of note, createLink actually creates a hard link, not a symbolic one."
Yes, it's a hard link, my mistake.
"Just to be clear, you're ending up with an ExecDirException in the worker log, if I'm tracing through this correctly, right?"
No. We use memory server and memory worker, not shard. ExecDirException appears in shard worker. Do you suggest we should use shard worker instead?
"are you certain that your workers are not sharing a filesystem/cache root, or that you only have a single worker, and that each isolated worker has only a single CASFileCache instance that refers to a single filesystem? "
We have 3 machines, each of them runs 2 worker processes. Each worker has its own cache directory.
For anything but toy demonstrations or debugging, I recommend using shard server/worker. That said, I don't see this being necessarily caused by that choice - CASFileCache should almost completely be the same between the two except for dependency injections.
That setup is a bit odd - the workers are designed so that it is less effective to run multiple on a single host representing a compute resource. Does it recur if you run the three machines with a single worker only?
"Does it recur if you run the three machines with a single worker only?"
Yes. I use only one machine with a single worker this time, and it recurs. I even try to set execute_stage_width and input_fetch_stage_width to 1.
Fascinating. Can you provide a) the exact copy of the exception from the worker, which will hopefully indicate a file path value, and b) a bf-cat TreeLayout of the root directory of the action's inputs? I'm concerned that our validation is not preventing a particular pattern of directory layouts that permits multiple paths to exist within an input root. No input contents should be required to validate this, but debugging this without a hint of the layout will be difficult.
Are you overriding the validation taking place in build.buildfarm.instance.server.AbstractServerInstance in any way?
I'm leery of just catching FileAlreadyExistsException, because there has to be some resolution to the file's existence, and the current behavior which prevents the execution from happening is failing safely enough. With the memory server, you're missing an errored behavior for operations which prevents executions from being requeued too many times - this is present on shard. I would consider adding this to memory simply for consistency (and unbounded loop prevention)'s sake.