aldbr
aldbr
We've encountered an issue where some access tokens are occasionally rejected by the Computing Elements (CEs). The primary error observed on the CE side is shown below: ``` [2024-01-29 10:43:53]...
Last item of https://github.com/DIRACGrid/DIRAC/issues/7459 BEGINRELEASENOTES *WorkloadManagement CHANGE: new job management mechanism in the PushJobAgent ENDRELEASENOTES
The objective is to enhance the exploitation of HPCs with no external connectivity in DIRAC. The current workflow is limited: - The `PushJobAgent` only works if you use the `dirac-jobexec`...
Job Rescheduling: Setting the job status and perform the rescheduling operation in a single command
Currently, there is a distinct separation in our workflow between marking a job as `RESCHEDULED` and performing the rescheduling action itself. This means if one process encounters an error, the...
The diversity and complexity of Computing Elements (CEs) in DIRAC have significantly increased, leading to a somewhat disorganized system. We have observed that the current structure, which broadly categorizes CEs...
As explained in https://github.com/DIRACGrid/DIRAC/pull/7025, the ` _ssh_call()` method does not seem to work perfectly. There exist a few popular python libraries to perform SSH operations like `Fabric` (Paramiko) (https://docs.fabfile.org/en/stable/). We...
The Site Director is responsible for generating and submitting pilot wrappers to various Computing Element communication interfaces, and deleting them afterwards. CE interfaces may modify pilot wrappers - e.g. to...
Currently, the [Watchdog](https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py) seems to compute the "[time left](https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py#L792)" based on the CPU work, which is the product of the CPUtime that we get from the underlying batch system, which...
Here is a potential issue I discovered running pilots on a SLURM batch system which could batch systems based on wallclock time (when cpu time left depends on real time...
Replace the Dirac-specific `SSH` class by `fabric`. BEGINRELEASENOTES *Resources CHANGE: Replace SSH by fabric in SSHComputingElement ENDRELEASENOTES