nanopore-basecalling-scripts icon indicating copy to clipboard operation
nanopore-basecalling-scripts copied to clipboard

Why not to add "--ignore-existing" to the rsync ? Also, any solution to "sequencing_summary.txt" getting overwritten?

Open danarte opened this issue 7 years ago • 4 comments

When I run rsync with the command you specified (without deleting the original files), it has two "issues" that might be solved with my suggestion

  1. In each loop all the .fast5 files are printed to the output.
  2. Also, the fast5 files on the server are getting "modified" on each loop, so I cant tell when the last file was actually copied and updated. So is there a reason not to add --ignore-existing to rsync and by that to avoid those issues ? (I guess, those issues rise only when I don't delete the original files with rsync)

Another question I had: the "sequencing_summary.txt" and "pipeline.log" are getting overwritten on every loop because each loop runs albacore. Is there any way to not to lose information from those files? maybe append the new lines to the existing files? or maybe create new files each time?

Thanks in advance, hope you could find the time to answer my questions.

danarte avatar Jul 23 '17 09:07 danarte

All good suggestions. Personally I don't mind that all the files are output to the screen. Regarding point 2) I probably don't see this because my file systems are mounted with 'noatime', which I recommend when dealing with millions of files, but your suggest is a good one for those who don't do that.

The overwriting of the Albacore output is a problem that I have been meaning to fix. I suggest that a step should be added to copy out the log files and add a unique postfix to the filename, perhaps the current timestamp.

Any pull requests appreciated if you wanted to tackle any of these issues.

nickloman avatar Jul 23 '17 09:07 nickloman

I can't decide if it's better to cat the sequence_summary.txt files rather than keeping multiple versions. It's useful to parse info from there whilst basecalling is happening. Cumulative stats can be got very quickly that way.

mattloose avatar Jul 23 '17 10:07 mattloose

Currently I implemented copying the files to a different file with the current time in each loop with:

    set CURTIME=`date +"%m_%d_%T"`
    cp basecalls/$FLOWCELL/pipeline.log basecalls/$FLOWCELL/$CURTIME.pipeline.log
    cp basecalls/$FLOWCELL/sequencing_summary.txt basecalls/$FLOWCELL/$CURTIME.sequencing_summary.txt

But I think matt's suggestion is better. I believe we can simply append the content of "pipeline.log" to a "master-pipeline" file, because there are timestamps in the file it would be easy to understand what is going on. Then we can append to a "master-sequencing_summary" file all the lines from the sequencing_summary.txt without the first line which is the header.

danarte avatar Jul 23 '17 11:07 danarte

To be honest - I think this stuff is best handled in the loop used to call these scripts rather than in the script itself?

mattloose avatar Jul 23 '17 16:07 mattloose