pypiper
pypiper copied to clipboard
Calculation of feature coverage does not take into account bed columns
The calculate_frip command in ngstk.py calculates FrIP as the sum of the 5th column of the sambamba detph command. However the sambamba command outputs the original bed file adding its own columns. The content of column 5 is then dependent on how many columns there are in the bed file, leading to a wrong reads count.
Potential fix, although it's performing the sambamba step twice, so to speak; but should be flexible to differing column numbers from the input bed.
def calculate_frip(self, input_bam, input_bed, output, cpus=4):
cmd1 = self.tools.sambamba + " depth region -t {0}".format(cpus)
cmd1 += " -L {0}".format(input_bed)
cmd1 += " {0}".format(input_bam)
cmd2 += cmd1 + " | cut -f $(" + cmd1
cmd2 += " | head -1" + " | sed 's/\t/\n/g'" + " | grep -n 'readCount'" + " | cut -f1 -d:)"
cmd2 += " | awk '{{sum+=$1}} END {{print sum}}' > {0}".format(output)
return cmd2
~what do you mean it's performing the sambamba step twice? it only returns cmd2...~ I see now that cmd2 runs cmd1 two times...
But why not just calculate the number of columns in the original bed file and use that to select?
Yeah, that works. I honestly didn't even think of that. Only issue there is if sambamba ever changes location of its readCount column? Im assuming you're saying take column 5 + number of columns in input bed?
yeah.
Or why not just run the command once, and then run head on the outfile to get the readCount column and use that?
I had thought of that, but was avoiding trying to have another temp file to delete, if that is a concern...
Yeah I guess that's a concern...but I feel more concerned about running the thing twice. It can potentially be a long step, right?
Yes. you're likely right. Also, could use cut -f 1-3 {}.format(input_bed) and therefore the readCount column is always column 4. But that still requires a temporary file. So far, ngstk.py does contain an unused remove_file() function but doesn't utilize manager.py clean_add() and I wasn't sure if introducing a temp file was something intended for ngstk.py.
@afrendeiro I think you wrote the original calculate_frip function -- did you have any comment on this?
Actually I currently don't know why the command was written in such a complicated way initially (perhaps derived from another use case) but I think this should do it:
def calculate_frip(self, input_bam, input_bed, output, cpus=4):
return (self.tools.sambamba + " view -t {0} -c -L {1} {2} > {3}"
.format(cpus, input_bed, input_bam, output))