DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

Remove the pilot executables in a deterministic way

Open aldbr opened this issue 4 years ago • 6 comments

The Site Director is responsible for generating and submitting pilot wrappers to various Computing Element communication interfaces, and deleting them afterwards. CE interfaces may modify pilot wrappers - e.g. to bundle a proxy - before submitting them to a Computing Element or a LRMS, they have the responsibility of removing the modified pilot wrappers afterwards.

Nevertheless, some of the CE interfaces that we use need to keep the pilot wrappers even after the submission step: it might take time before being actually uploaded into the LRMS (HTCondor), or sometimes it might be used as an input of another program and need to be present until the end of the execution (LocalCE + ParallelLibrary)

Currently in these cases, the Site Director gives the responsibility of cleaning the executables to the CE interfaces as you can see here, but these ones are simple and do not store information about the submissions done and, thus, have to perform find calls to delete old files based on a date e.g. HTCondorCE.

To avoid that, I propose a new approach:

  • The Site Director keeps the responsability of removing the executable files that it generated, always.
  • The CE interfaces that need to keep executable files longer returns this information in a dictionary after submitting the job such as:
def submitJob(...):
...
# Existing code
result = S_OK(jobIDs)
result['PilotStampDict'] = stampDict

# What we could add
result['ExecutableToKeep'] = executablePath

return result

They can also return an executable that they have modified, so they basically say to the caller: "I cannot manage the executable I just created, sorry, here it is if you want to do something with it".

  • The Site Director, after submitting a pilot wrapper, checks whether the executable can be removed immediately:
submitResult = ce.submitJob(executable, '', pilotSubmissionChunk)

if not 'ExecutableToKeep' in submitResult or submitResult['ExecutableToKeep'] != executable:
  os.unlink(executable)
...
# maybe after registering the pilots in the PilotAgentsDB
# if there is an executable to keep and that has not been changed by the CE interface, we store it
if 'ExecutableToKeep' in submitResult:
  self._storeExecutable(submitResult)
  • The SiteDirector._storeExecutable() would add a new entry in a new table named PilotExecutable or PilotWrapper defined in PilotAgents.sql such as:
CREATE TABLE `PilotWrapper` (
  `PilotID` INT(11) UNSIGNED NOT NULL,
  `ExecutablePath` VARCHAR(255) NOT NULL DEFAULT 'Unknown',
  `Deleted` ENUM('True','False') NOT NULL DEFAULT 'False',
  PRIMARY KEY (`PilotID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

  • When monitoring the pilots, the Site Director could then check in this table, and find all the executable paths that do not belong to pilots that still run, and delete them, with a method inPilotAgentsDB.p that would perform:
SELECT ExecutablePath 
FROM PilotWrapper 
WHERE ExecutablePath.Deleted != 'False'
  AND CE = <CE>
  AND Queue = <Queue>
  AND ExecutablePath not in (SELECT ExecutablePath 
                             FROM PilotWrapper, PilotAgents 
                             WHERE PilotWrapper.PilotID = PilotAgents.PilotID
                             AND PilotAgents.Status == <unfinished_status>)

What do you think of this solution? @andresailer

aldbr avatar May 06 '21 12:05 aldbr

Have you tried adding -spool to https://github.com/DIRACGrid/DIRAC/blob/9ac5c0f26e26b38527dec32a8e22382062f842ab/src/DIRAC/Resources/Computing/HTCondorCEComputingElement.py#L305 and seeing if that means you can directly remove the executable? cf. https://research.cs.wisc.edu/htcondor/manual/v8.7/Condorsubmit.html

andresailer avatar May 06 '21 13:05 andresailer

Well, that would be much simpler I have to admit. I will give it a try. If this works, we can go for that for HTCondorComputingElement and LocalCE + Condor.

There is still this case where the LocalCE uses a ParallelLibrary that would need to keep the executable until the end of the execution I guess. The LocalCE seems to be rarely used, so may be it is not so important if it does not clean the executable files after their execution... Any thought about this case?

aldbr avatar May 06 '21 13:05 aldbr

The HTCondor option might be good for HTCondor, but of course we don't have only that use case. Storing the executable in a DB (not sure if MySQL is best...) is certainly more generic, eve though it adds a layer of complication.

fstagni avatar May 06 '21 14:05 fstagni

Another solution for the LocalCE+ParallelLibrary could consist of embedding the content of the executable in the parallel library executable. For instance the srun (a ParallelLibrary class) wrapper could contain:

cat > executable.sh <<EOF
%(pilotWrapperContent)s
EOF
srun -l -k executable.sh

This solution would allow to delete the pilot wrappers after their submission.

  • The combination of this solution + the spool parameter for HTCondor would resolve (correct me if I am wrong) all the current problems related to this issue: no more executable to delete in a non-deterministic way with find calls.
  • But this is not generic and we still might have issues in the future with new type of CEs...

aldbr avatar May 10 '21 13:05 aldbr

  • The combination of this solution + the spool parameter for HTCondor would resolve (correct me if I am wrong) all the current problems related to this issue: no more executable to delete in a non-deterministic way with find calls.
  • But this is not generic and we still might have issues in the future with new type of CEs...

I think you already answered yourself here.

fstagni avatar May 10 '21 16:05 fstagni

Update: IIRC, the main "issue" is still the usage of the local scheduler in HTCondorCE here.

  • 3 years ago, I proposed to stop using it in https://github.com/DIRACGrid/DIRAC/pull/5137 but then we realized that submissions with a local scheduler were much faster than with remote scheduler (likely because the submission task was offloaded from the Site Director to the local scheduler and done in parallel) : https://github.com/DIRACGrid/DIRAC/pull/5137#issuecomment-900278485
  • 1 year ago, I proposed https://github.com/DIRACGrid/DIRAC/pull/7110 to speed up the Site Director from v9. The Site Director can now submit pilots to different CEs in parallel.

It think it's worth measuring the submission duration during the next hackathon to see if it now matches with the performances of the local scheduler. Submitting pilots to a single CE will obviously still be better using a local scheduler but I expect performances to be ~similar when submitting pilots to many CEs.

aldbr avatar Nov 29 '24 09:11 aldbr