[hackathon/real life] non-cvmfs version of pilot does not run at RAL-LCG2
During the hackathon pilot jobs at RAL-LCG2 kept failing. I was not able to retrieve the logs of the failed jobs, but from the running jobs I managed to retrieve the following excerpts: pilot.log
Linking glibmm-2.4-2.66.3-h87e66e5_0
Linking gdk-pixbuf-2.42.12-hb9ae30d_0
error libmamba response code: -1 error message: Invalid argument
critical libmamba failed to execute pre/post link script for gdk-pixbuf
2024-06-06T12:52:35.119845Z DEBUG [InstallDIRAC] Return code of bash DIRACOS-Linux-x86_64.sh 2>&1: 1
2024-06-06T12:52:35.120660Z ERROR [InstallDIRAC] Could not install DIRACOS [ERROR 1]
2024-06-06T12:52:35.120768Z INFO [InstallDIRAC] Content of pilot.cfg
pilot.error
https://lbcertifdirac70.cern.ch unreacheable (this is normal!)
Traceback (most recent call last):
File "dirac-pilot.py", line 115, in <module>
command.execute()
File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotCommands.py", line 81, in wrapper
return func(self)
File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotCommands.py", line 427, in execute
self._localInstallDIRAC()
File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotCommands.py", line 330, in _localInstallDIRAC
self.exitWithError(retCode)
File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotTools.py", line 797, in exitWithError
with open("pilot.cfg") as f:
IOError: [Errno 2] No such file or directory: 'pilot.cfg'
Owner = "dteam077"
ActivationDuration = 187
Cmd = "/var/spool/arc/grid08/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/condorjob.sh"
User = "[email protected]"
LastMatchTime = 1717678179
StreamOut = false
JobPrio = 0
CumulativeRemoteUserCpu = 0.0
JobStartDate = 1717678179
MachineRalScaling = "$$([ifThenElse(isUndefined(RalScaling), ifThenElse(isUndefined(ScalingFactor), 1.00, ScalingFactor), RalScaling)])"
TargetType = "Machine"
LastPublicClaimId = "<130.246.219.48:9618?addrs=130.246.219.48-9618+[2001-630-54-10-82f6-db30--]-9618&alias=lcg2641.gridpp.rl.ac.uk&noUDP&sock=startd_4774_2538>#1715586785#9783#..."
TransferInputStats = [ CedarFilesCountTotal = 999; CedarFilesCountLastRun = 999 ]
OnExitRemove = true
RalAcctGroup = "group_DTEAM_OPS"
JobCurrentFinishTransferInputDate = 1717678179
OriginalTransferInput = "/var/spool/arc/grid08/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm"
scan-condor-job: ----- end condor history message -----
scan-condor-job: ----- Information extracted from condor_history -----
scan-condor-job: [email protected]
scan-condor-job: RemoteWallClockTime=187
scan-condor-job: RemoteUserCpu=0
scan-condor-job: RemoteSysCpu=0
scan-condor-job: ImageSize=40
scan-condor-job: ExitCode=1
scan-condor-job: ExitStatus=0
scan-condor-job: JobStatus=4
scan-condor-job: JobCurrentStartDate=1717678179
scan-condor-job: EnteredCurrentStatus=1717678366
scan-condor-job: RequestCpus=8
scan-condor-job: -----------------------------------------------------
scan-condor-job: LRMSStartTime=20240606124939Z
scan-condor-job: LRMSEndTime=20240606125246Z
scan-condor-job: Job failed with exit code 1
2024-06-06T12:53:18Z Job state change INLRMS -> FINISHING Reason: Job failure detected
2024-06-06T12:53:18Z Job state change FINISHING -> FINISHED Reason: Job failure detected
We've seen the same issue on our production instance, and we are working around it by getting the pilot off cvmfs. Simon thinks this might be related to: https://github.com/mamba-org/mamba/issues/2501 Note that this behaviour several hundred jobs per hour that then fail, and that this is how my DN got banned at RAL before. (Hence killing all user jobs targeting RAL before leaving the hackthon is a necessity.)
Isn't this solved by the last DIRACOS release?
Why did it show up in the hackathon then ?
The last release was created only yesterday: https://github.com/DIRACGrid/DIRACOS2/releases/tag/2.42
Keep the ticket open until the workshop when we do another hackathon ?
We are facing a similar problem with the lastest tag 2.43 but only at some sites.
I don't know if the error was present also before.
Here below the error we get:
Linking gdk-pixbuf-2.42.12-hb9ae30d_0
error libmamba response code: -1 error message: Permission denied
critical libmamba failed to execute pre/post link script for gdk-pixbuf
2024-11-07T12:15:41.289931Z WARN [InstallDIRAC] Could not install DIRACOS from CVMFS [ERROR 1]
2024-11-07T12:15:55.879473Z INFO [InstallDIRAC] Executing command bash DIRACOS-Linux-x86_64.sh 2>&1
Any suggestion?
Thank you.
Solution is somewhere in https://github.com/mamba-org/mamba/issues/2501
Run with ulimit -n 1048575
Yes thank you, but this means that ulimit must be changed by site admins, right?
We can limit in the pilot.
ok but how can I do it? Thanks
You do not have to do anything: #7891
OK thank you
Given that the above mentioned PR is merged, I believe this issue can be closed.
I've updated pilot version to DIRAC 8.0.60 but I still get the same error.
Any idea?
@fstagni Maybe a open a new issue?
The change is on the server side. Did you also update the server?
ah non I didn't noticed that... I thought it was only for the pilot. I will try to update the server as well and let you know.