GenomicFeatures pmapFromTranscripts strange behaviour

pmapFromTranscripts strange behaviour

Open Roleren opened this issue 6 years ago • 12 comments

There is some strange behavour on how it handles names in the transcripts: Sorry for bad test data, made this quickly:

tx is a GRangesList of 100.000 transcripts: ranges is 600.000 ORFs on the transcripts as IRanges orfs$index is the index for each orf which transcript it came from

See how the time is different:

Without names:

grl <- tx 
names(grl) <- NULL
system.time(pmapFromTranscripts(x = ranges, transcripts = grl[orfs$index]))
   user  system elapsed 
 19.661   1.701  21.355

With names:

grl <- tx
system.time(pmapFromTranscripts(x = ranges, transcripts = grl[orfs$index]))
   user  system elapsed 
 74.474   3.616  78.071

Without names, and set them afterwards, so result is same as 2.

names(grl) <- NULL
system.time({genomic <- pmapFromTranscripts(x = ranges, transcripts = grl[orfs$index]);
                     names(genomic) <- names(tx)[orfs$index] })
   user  system elapsed 
 19.963   1.634  21.591

So this means that 2. is almost 4 times slower, while we could have done 3 , which is as fast a 1.

Is this intentional ?

sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicFeatures_1.33.2 GenomicRanges_1.33.13 IRanges_2.15.17

...

Sep 19 '18 10:09 Roleren

GenomicFeatures GenomicFeatures copied to clipboard

pmapFromTranscripts strange behaviour

GenomicFeatures
GenomicFeatures copied to clipboard