pypiper time accumulation from profile is inaccurate

in a recent run I saw this:

> `Time`	4:14:23	PEPPRO	_RES_
> `Success`	08-09-14:18:00	PEPPRO	_RES_

##### [Epilogue:]
*   Total elapsed time:  11:18:17.100000

This was a single run. so why the huge discrepancy between 'time' and 'total elapsed time', since it's not coming from a previous run?

I think it's because we're double-counting piped processes. So the bowtie2 command is piped to samtools. these are different hashes, but they haev the same time:

413534	13e1541610	24	2:16:26.790000	0.8886	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413533	ea03362230	23	2:16:17.120000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413532	603f76945d	22	2:15:56.990000	3.5303	bowtie2	lock.aligned_hg38__GSM3618131_sort.bam
440203	cfc45bc51c	28	1:15:05.910000	0.888	samtools	lock.aligned_hg38__GSM3618131_sort_dups.bam
440201	c1225bc513	26	1:14:14.460000	3.5274	bowtie2	lock.aligned_hg38__GSM3618131_sort_dups.bam
440202	ea03362230	27	1:14:14.380000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort_dups.bam
411725	159cf5da58	18	0:07:52.500000	1.6758	fastq_pair	lock.fastq__GSM3618131_R2_trimmed_dups.fastq.paired.fq
409959	a492c7578f	14	0:06:47.670000	1.6388	fastq_pair	lock.fastq__GSM3618131_R2_trimmed.fastq.paired.fq
448047	04cc4259c8	55	0:01:45.390000	0.0143	preseq	lock.QC_hg38__GSM3618131_preseq_yield.txt

it must be duplicating that.

perhaps hashes should be done on commands rather than on individual steps in a pipeline?

Aug 12 '19 18:08 nsheff

in this example, processes 22, 23, and 24 were run simultaneously.

26,27,28 likewise.

they were piped streams. these two commands are getting triple counted, hence the almost tripled report of time spent.

Aug 12 '19 18:08 nsheff

good catch. Yeah, it looks like the subcommands in the piped command are timed with an offset equal to the current runtime for the entire pipe, which is wrong.

perhaps hashes should be done on commands rather than on individual steps in a pipeline?

good idea. Should be really easy, as well

Aug 12 '19 19:08 stolarczyk

what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...

that way, you'd know if two commands were run in the same run or in a different run.

Aug 13 '19 14:08 nsheff

good idea. Should be really easy, as well

then how do you calculate the time, by taking the maximum value for the hash? assuming that the maximum runtime across any piped commands is the time from all those piped commands which were running simultaneously? is that true though?

Aug 13 '19 14:08 nsheff

the reported time is not the max but the last one. However, the actual pipe order is not preserved in the profile.. so in this case:

413534	aaaaaaaa	24	2:16:26.790000	0.8886	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413533	aaaaaaaa	23	2:16:17.120000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413532	aaaaaaaa	22	2:15:56.990000	3.5303	bowtie2	lock.aligned_hg38__GSM3618131_sort.bam

the returned time for the pipe is 2:15:56.990000. Whereas the correct answer is 2:16:26.790000.

So what the current approach is missing (I just realized) is ordering the rows that are assigned the same hash by the command ID before the last one is returned.

what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...

that a good idea

Aug 13 '19 15:08 stolarczyk

another problem with accumulating time in this way is that it doesn't account for any time that is not spent in the 'run' command, right? I think we discussed this...

but what about time spent in, for example, result reporting functions or follow functions? These can actually be substantial.

Aug 13 '19 21:08 nsheff

pypiper pypiper copied to clipboard

time accumulation from profile is inaccurate

pypiper
pypiper copied to clipboard