pypiper
pypiper copied to clipboard
time accumulation from profile is inaccurate
in a recent run I saw this:
> `Time` 4:14:23 PEPPRO _RES_
> `Success` 08-09-14:18:00 PEPPRO _RES_
##### [Epilogue:]
* Total elapsed time: 11:18:17.100000
This was a single run. so why the huge discrepancy between 'time' and 'total elapsed time', since it's not coming from a previous run?
I think it's because we're double-counting piped processes. So the bowtie2 command is piped to samtools. these are different hashes, but they haev the same time:
413534 13e1541610 24 2:16:26.790000 0.8886 samtools lock.aligned_hg38__GSM3618131_sort.bam
413533 ea03362230 23 2:16:17.120000 0.0046 samtools lock.aligned_hg38__GSM3618131_sort.bam
413532 603f76945d 22 2:15:56.990000 3.5303 bowtie2 lock.aligned_hg38__GSM3618131_sort.bam
440203 cfc45bc51c 28 1:15:05.910000 0.888 samtools lock.aligned_hg38__GSM3618131_sort_dups.bam
440201 c1225bc513 26 1:14:14.460000 3.5274 bowtie2 lock.aligned_hg38__GSM3618131_sort_dups.bam
440202 ea03362230 27 1:14:14.380000 0.0046 samtools lock.aligned_hg38__GSM3618131_sort_dups.bam
411725 159cf5da58 18 0:07:52.500000 1.6758 fastq_pair lock.fastq__GSM3618131_R2_trimmed_dups.fastq.paired.fq
409959 a492c7578f 14 0:06:47.670000 1.6388 fastq_pair lock.fastq__GSM3618131_R2_trimmed.fastq.paired.fq
448047 04cc4259c8 55 0:01:45.390000 0.0143 preseq lock.QC_hg38__GSM3618131_preseq_yield.txt
it must be duplicating that.
perhaps hashes should be done on commands rather than on individual steps in a pipeline?
in this example, processes 22, 23, and 24 were run simultaneously.
26,27,28 likewise.
they were piped streams. these two commands are getting triple counted, hence the almost tripled report of time spent.
good catch. Yeah, it looks like the subcommands in the piped command are timed with an offset equal to the current runtime for the entire pipe, which is wrong.
perhaps hashes should be done on commands rather than on individual steps in a pipeline?
good idea. Should be really easy, as well
what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...
that way, you'd know if two commands were run in the same run or in a different run.
good idea. Should be really easy, as well
then how do you calculate the time, by taking the maximum value for the hash? assuming that the maximum runtime across any piped commands is the time from all those piped commands which were running simultaneously? is that true though?
the reported time is not the max but the last one. However, the actual pipe order is not preserved in the profile.. so in this case:
413534 aaaaaaaa 24 2:16:26.790000 0.8886 samtools lock.aligned_hg38__GSM3618131_sort.bam
413533 aaaaaaaa 23 2:16:17.120000 0.0046 samtools lock.aligned_hg38__GSM3618131_sort.bam
413532 aaaaaaaa 22 2:15:56.990000 3.5303 bowtie2 lock.aligned_hg38__GSM3618131_sort.bam
the returned time for the pipe is 2:15:56.990000. Whereas the correct answer is 2:16:26.790000.
So what the current approach is missing (I just realized) is ordering the rows that are assigned the same hash by the command ID before the last one is returned.
what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...
that a good idea
another problem with accumulating time in this way is that it doesn't account for any time that is not spent in the 'run' command, right? I think we discussed this...
but what about time spent in, for example, result reporting functions or follow functions? These can actually be substantial.