ARC icon indicating copy to clipboard operation
ARC copied to clipboard

Socket time out error when error handling

Open xiaoruiDong opened this issue 5 years ago • 0 comments

Recently, C3DDB server is suffering from multiple issues: file system issues and queue software issues, which helps us identify more cases ARC needs to consider. For example, today, I (manually) and ARC couldn't check the job status by running squeue -u xiaorui for several. It will return an error like

squeue: error: slurm_receive_msg: Socket timed out on send/recv operation, 
slurm_load_jobs error: Socket timed out on send/recv operation

Currently, I find a case where this can be problematic. When ARC submits a new job and cannot read the queue info, It will regard the job as finished. However, when it reads the outputs, it will find that the job has never started. ARC will try another launch, which, of course, fails again. Then the job will be terminated and marked as having an error. I guess there can be other potential risks. It is worthy to have ARC handling this kind of situation.

xiaoruiDong avatar May 08 '20 04:05 xiaoruiDong