pypiper icon indicating copy to clipboard operation
pypiper copied to clipboard

Calculation of feature coverage does not take into account bed columns

Open ghost opened this issue 7 years ago • 9 comments

The calculate_frip command in ngstk.py calculates FrIP as the sum of the 5th column of the sambamba detph command. However the sambamba command outputs the original bed file adding its own columns. The content of column 5 is then dependent on how many columns there are in the bed file, leading to a wrong reads count.

ghost avatar Jul 31 '18 08:07 ghost

Potential fix, although it's performing the sambamba step twice, so to speak; but should be flexible to differing column numbers from the input bed.

def calculate_frip(self, input_bam, input_bed, output, cpus=4): cmd1 = self.tools.sambamba + " depth region -t {0}".format(cpus) cmd1 += " -L {0}".format(input_bed) cmd1 += " {0}".format(input_bam) cmd2 += cmd1 + " | cut -f $(" + cmd1 cmd2 += " | head -1" + " | sed 's/\t/\n/g'" + " | grep -n 'readCount'" + " | cut -f1 -d:)" cmd2 += " | awk '{{sum+=$1}} END {{print sum}}' > {0}".format(output) return cmd2

jpsmith5 avatar Aug 01 '18 15:08 jpsmith5

~what do you mean it's performing the sambamba step twice? it only returns cmd2...~ I see now that cmd2 runs cmd1 two times...

But why not just calculate the number of columns in the original bed file and use that to select?

nsheff avatar Aug 01 '18 15:08 nsheff

Yeah, that works. I honestly didn't even think of that. Only issue there is if sambamba ever changes location of its readCount column? Im assuming you're saying take column 5 + number of columns in input bed?

jpsmith5 avatar Aug 01 '18 15:08 jpsmith5

yeah.

Or why not just run the command once, and then run head on the outfile to get the readCount column and use that?

nsheff avatar Aug 01 '18 17:08 nsheff

I had thought of that, but was avoiding trying to have another temp file to delete, if that is a concern...

jpsmith5 avatar Aug 01 '18 18:08 jpsmith5

Yeah I guess that's a concern...but I feel more concerned about running the thing twice. It can potentially be a long step, right?

nsheff avatar Aug 01 '18 18:08 nsheff

Yes. you're likely right. Also, could use cut -f 1-3 {}.format(input_bed) and therefore the readCount column is always column 4. But that still requires a temporary file. So far, ngstk.py does contain an unused remove_file() function but doesn't utilize manager.py clean_add() and I wasn't sure if introducing a temp file was something intended for ngstk.py.

jpsmith5 avatar Aug 04 '18 16:08 jpsmith5

@afrendeiro I think you wrote the original calculate_frip function -- did you have any comment on this?

nsheff avatar Oct 22 '18 12:10 nsheff

Actually I currently don't know why the command was written in such a complicated way initially (perhaps derived from another use case) but I think this should do it:

def calculate_frip(self, input_bam, input_bed, output, cpus=4):
	return (self.tools.sambamba + " view -t {0} -c  -L {1}  {2} > {3}"
               .format(cpus, input_bed, input_bam, output))

afrendeiro avatar Oct 23 '18 09:10 afrendeiro