ChIPseeker icon indicating copy to clipboard operation
ChIPseeker copied to clipboard

options(ChIPseeker.downstreamDistance = 500) seems not work

Open shangguandong1996 opened this issue 4 years ago • 1 comments

Prerequisites

  • [x] Have you read Feedback and follow the guide?
    • [x] make sure your are using the latest release version
    • [x] read the documents
    • [x] google your quesion/issue

Describe you issue

  • [x] Make a reproducible example (e.g. 1)
  • [x] your code should contain comments to describe the problem (e.g. what expected and actually happened?)

Ask in right place

  • [x] for bugs or feature requests, post here (github issue)
  • [ ] for questions, please post to Bioconductor or Biostars with tag ChIPseeker

Hi, Dear developer Considering the issue #112 I mentioned before have not been fixed in 1.24.0, and I also find some related issue #121, I decide to find the reason by myself. And Please forgive me If I misunderstand something.

According to the code below https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/getGenomicAnnotation.R#L183-L187

function get the downstreamDistance defined by user, but I find it seems to be wrong. Because you use the precede to get the nearest downstream gene and get distance accoring to the code below. So it seems all the distance will be negative ? https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/getGenomicAnnotation.R#L148-L168

And is it inappropriate to use downstreamIndex <- dd2 > 0 & dd2 < dsd to get the downstream idx accoring to user's cutoff?

And I also find you actually do not change the content of annotation, which means whatever downstreamDistance is, the final result will not change because annotation is the content used instead of detailGenomicAnnotation. https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/getGenomicAnnotation.R#L190

The reason the Downstream(<=N) in the plot will change accoring to user is you did change variable in the plot. https://github.com/YuLab-SMU/ChIPseeker/blob/10216af5d314641f0a3aad294a3c1163e7ad328e/R/plotAnno.R#L185-L193

But all the content used is from annotation instead of detailGenomicAnnotation. And it seems annotation will not be influenceed by downstreamDistance https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/annotatePeak.R#L162-L164 https://github.com/YuLab-SMU/ChIPseeker/blob/10216af5d314641f0a3aad294a3c1163e7ad328e/R/plotAnno.R#L154-L157

And I also find some issue about it

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb <- TxDb.Hsapiens.UCSC.hg19.knownGene

files <- getSampleFiles()
peakAnno <- annotatePeak(files[[2]], tssRegion=c(-500, 0),
                         TxDb=TxDb, 
                         level = "gene")
# According to the peakAnno result, there will be about 20 peaks in the downstream. 
> peakAnno
Annotated peaks generated by ChIPseeker
2296/2296  peaks were annotated
Genomic Annotation Summary:
             Feature  Frequency
9           Promoter  0.7404181
4             5' UTR  0.6533101
3             3' UTR  1.5243902
1           1st Exon  0.4355401
7         Other Exon  2.6567944
2         1st Intron 14.5905923
8       Other Intron 31.6637631
6 Downstream (<=300)  0.8710801
5  Distal Intergenic 46.8641115

# But according to the result below, it seems no peak locating downstream
> sum(peakAnno@detailGenomicAnnotation$downstream)
[1] 0

Best wishes Guandong Shang

shangguandong1996 avatar Dec 17 '20 08:12 shangguandong1996

  1. The first question of misuse of precede(). I agree with it! An example is showed below.
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## get the sample files from chipseeker
files <- getSampleFiles()
peak <- readPeakFile(files[[4]])

## import getGene() from utilities.R
## import .ChIPseekerEnv() from utilities.R
## get the "transcript" from reference genome
features <- getGene(txdb, by="transcript")

> features
GRanges object with 73432 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id
          <Rle>         <IRanges>  <Rle> | <integer>
     1    chr19 58858172-58864865      - |     70455
     1    chr19 58859832-58874214      - |     70456
    10     chr8 18248755-18258723      + |     31944
   100    chr20 43248163-43280376      - |     72132
  1000    chr18 25530930-25616539      - |     65378
   ...      ...               ...    ... .       ...
  9994     chr6 90539619-90584155      + |     25089
  9997    chr22 50961997-50964033      - |     75292
  9997    chr22 50961997-50964034      - |     75293
  9997    chr22 50961997-50964570      - |     75294
  9997    chr22 50961997-50964905      - |     75295
           tx_name
       <character>
     1  uc002qsd.4
     1  uc002qsf.2
    10  uc003wyw.1
   100  uc002xmj.3
  1000  uc010xbn.1
   ...         ...
  9994  uc011dzz.2
  9997  uc003bma.3
  9997  uc003blz.4
  9997  uc021wrz.1
  9997  uc021wsa.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

## use precede() to get the nearest upstream gene index
if (sameStrand) {
  idx <- precede(peak, features)
} else {
  idx <- precede(peak, unstrand(features))
}

dd <- ifelse(strand(peF) == "+", 
             start(peaks) - end(peF), 
             end(peaks) - start(peF)) 

head(dd)
> head(dd)
[1]  -35070   -3769 -375208  -86583 -196097   -1702

## it is clear that the distance is a negative number

## try an gene out to see the position of peak and gene
head(idx)
> head(idx)
[1]   520  9934 52073 56971 57670 32704


## As we can see the index of 
## the neareast upstream gene near the second peak of sample file is 9934.
## so we get a look get the the peak and "9934-transcript"
features[9934]
peak[2]

> features[9934]
GRanges object with 1 range and 2 metadata columns:
         seqnames          ranges strand |     tx_id
            <Rle>       <IRanges>  <Rle> | <integer>
  126789     chr1 1244466-1247057      + |        82
             tx_name
         <character>
  126789  uc009vjx.3
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
> peak[2]
GRanges object with 1 range and 2 metadata columns:
      seqnames          ranges strand |          V4
         <Rle>       <IRanges>  <Rle> |    <factor>
  [1]     chr1 1243288-1244338      * | MACS_peak_2
             V5
      <numeric>
  [1]     63.19
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

#####################################################################
## it is clear that peak[2]->(1243288,1244338)
## features[9934]->(1244466,1247057)
## peak[2] is ahead of features[9934]

The reason is because precede() is to get the genes that are at the back of peaks.

precede: For each range in x, precede returns the index of the range in subject that is directly preceded by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.(https://www.rdocumentation.org/packages/GenomicRanges/versions/1.24.1/topics/nearest-methods)

So the negetive distance is not the fault, while getting the wrong genes is the point. The negetive distance only reflects that precede() gets the gene is at the back of peaks.

follow() can be used to get the genes in front of peaks.

follow: The opposite of precede, follow returns the index of the range in subject that is directly followed by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.(https://www.rdocumentation.org/packages/GenomicRanges/versions/1.24.1/topics/nearest-methods)

  1. The second question of not using annotation. I do not agree with it! It is right that getGenomicAnnotation() do not change the annotation. https://github.com/YuLab-SMU/ChIPseeker/blob/d1ca24742edabf0e9b08711244bcb6ec01dd11f5/R/getGenomicAnnotation.R#L169-L189 as it show below it only change the detailGenomicAnnotation but when using annotatePeak() , it use getGenomicAnnoStat(). getGenomicAnnoStat() change the annotation instead. https://github.com/YuLab-SMU/ChIPseeker/blob/d1ca24742edabf0e9b08711244bcb6ec01dd11f5/R/plotAnno.R#L191-L208 That is not the reason why options(ChIPseeker.downstreamDistance = 500) do not work, the reason is because https://github.com/YuLab-SMU/ChIPseeker/issues/112#issuecomment-818660296.

MingLi-929 avatar Apr 13 '21 11:04 MingLi-929