dockstore-cgpwgs icon indicating copy to clipboard operation
dockstore-cgpwgs copied to clipboard

Frequent crashes at various steps of the workflow

Open suhrig opened this issue 5 years ago • 1 comments

Dear Keiran,

The containers available at dockstore crash frequently in our environment. I tried the versions 1.0.8, 1.1.2, 1.1.3, and 1.1.4. The crashes occur at random steps of the workflow even for the same dataset, which led me to believe that it is a technical issue and unrelated to the data. With few exceptions, I could not find any error messages in the log files. The *.wrapper.log files contained an exit code of 255 and the files inside the timings folder, too. But other than that there was no hint about the source of the error in any of the other log files.

After extensive debugging I managed to track down the crashes to two issues:

  • In order to launch a job, a command is written to a shell script file, for example WGS_tumor_vs_control/caveman/tmpCaveman/logs/Sanger_CGP_Caveman_Implement_caveman_estep.94.sh. This script is then made executable and called right after. Apparently, some versions/storage drivers of docker have an issue with this. When there is no delay between making the script executable and running it, occasionally the change in permissions has not yet become effective before the script is run, resulting in an error Text file busy and the termination of the workflow. Others have reported this issue, too: https://github.com/moby/moby/issues/9547. Supposedly, it helps to insert a sync or sleep 1 between making the script executable and running it. I am not sure whether this helps, because switching to singularity fixed this issue for me, so I did not bother to find out which scripts would need to be modified and actually try it out. Even though this is not a bug in the workflow itself but in Docker, you might want to consider inserting a sync, because other users might run into the same error.

  • After solving the above issue, only about half of the runs crashed (rather than 9/10). The remaining crashes were caused by the need_backoff function in /opt/wtsi-cgp/lib/perl5/PCAP/Threaded.pm. The following line occasionally threw an error Use of uninitialized value $one_min: $ret = 1 if($one_min > $self->{'system_cpus'}); I was unable to find out, why $one_min is undefined sometimes. I tried writing the value of $uptime to STDERR to check, if the regex fails to match, but for reasons I do not understand the values did not get written to the log files of the workflow. I tried replacing the uptime tool with something that is guaranteed to produce an output string matching the regex, but the error still occurred. At this point, I'm thinking that perhaps the call to the external command uptime from within Perl fails from time to time. I eventually gave up, since it takes days to reproduce the issue and I was able to avoid the crashes altogether by simply wrapping the offending line into this:

if (defined $one_min) {
  $ret = 1 if($one_min > $self->{'system_cpus'});
}

I assume you do not bump into these issues as often as I do, because you certainly would have noticed an error that affects a major fraction of the runs. I have no explanation as to why these two errors happen so frequently in our environment. Still, I was able to reproduce the issues on various systems (openSuSE/CentOS) with various kernel/Docker versions and various storage drivers, so other users might be affected, too. I therefore figured that it is reasonable to take precautions to circumvent the errors and wanted to give you this feedback.

Regards, Sebastian

suhrig avatar Sep 03 '18 10:09 suhrig