Random Cluster Bugs
Hi @candidechamp, the new copying back from the cluster scratch, did fail for me! The run did not copy all files back to the work folder (cnfs missing). It looks, like somehow the copying after run fails/ the scratch folder is not found anymore?
Does that also happen for you?
Here I attached the output file: CHK1_nd5_enr3_complex_prod_1SS_21r_3_sopt4_rb3_max8_md.txt
You probably did not specify that you want the calculation to occur in the scratch directory? What does your input job_name.sh look like?
Other idea could be that you are not on the correct pygromos commit (but I checked the version I have and it doesn't have any major difference from the pygromos v1)?
I don't pass the work_dir command, therefore I should use the default of the new branch
The pygromos version is correct for sure, as it is the standard one for reeds?
I wonder if there was a cluster anomaly? or if the approach with the ssh-script is not robust?
I'll also check out the branch and see if it works for me
Somehow, we got now the impression, that this might related to a temporary communication problem of the nodes. So right now let's collect all awkward bugs on the pipeline here and maybe we can make some sense of it. The problems occur for me apparently rarely.
For me, the same thing happened: after checking out the newest version of the eoff rebalancing branch (which includes the minor rework of the submission pipeline), only the files from one node are copied back correctly, the rest are missing...
Does it (usually) work for you @candidechamp @SchroederB even when the job is distributed among different nodes?
@SalomeRonja I havn't had a single issue so far. I just diffed my local branch and the /origin/main and I don't see anything wrong.
Are you 100% sure the files job_system_name.sh which submit the calculation where generated by the new code?
Ah, after a closer look, the problem was that the job timed-out - I didn't think to increase the duration_per_job now that the simulation and cleanup are done in one job. After I increased it, it worked fine :)
@SalomeRonja Thanks for looking into it. That's unfortunately a drawback we can't really do anything about when running multi-node jobs. If the wall-time is reached, we have no way of getting the data, the only people who can fix this is people who develop the LSF queuing system.
@candidechamp but we can still fall back to the old work_dir flag if desired, right?
@SchroederB You can, but this makes the cluster slow for everyone.
Uah, was that stated by the Cluster Support? I thought you told me it was not such a big Deal?
Oh no sorry actually the cluster people said something slightly different:
""" If a program does "nice" and large writes/reads, then you won't notice the latency difference. When a program does "bad" I/O (a lot of small random reads/writes, only a few bytes per read/write), then the latency will kick in and make everything slow.
An advantage of the local scratch is, that it is independent. If people do stupid stuff on /cluster/work, it will slow down the entire file system, i.e., your job could negatively be affected by the actions of other users, while you don't have this problem on the local scratch.
Copying the data from/to local scratch can even be optimized and parallelized (using gnu parallel to untar several tar archives in parallel to $TMPDIR, using multiple cores). In some test a user could copy 360 GB of data from /cluster/work to $TMPDIR within like 3 or 4 minutes. When a job runs for several hours, then a few minutes will not cause a lot of overhead compared to the total runtime. """
Ah ok, ja I think the default should be the scratch solution on the node. Just in case we want to test/debug something, it still is nice to keep the option of opting out there.
@SalomeRonja @epbarros I think this issue may be closed now?