ChIPseeker
ChIPseeker copied to clipboard
options(ChIPseeker.downstreamDistance = 500) seems not work
Prerequisites
- [x] Have you read Feedback and follow the guide?
- [x] make sure your are using the latest release version
- [x] read the documents
- [x] google your quesion/issue
Describe you issue
- [x] Make a reproducible example (e.g. 1)
- [x] your code should contain comments to describe the problem (e.g. what expected and actually happened?)
Ask in right place
- [x] for bugs or feature requests, post here (github issue)
- [ ] for questions, please post to Bioconductor or Biostars with tag
ChIPseeker
Hi, Dear developer Considering the issue #112 I mentioned before have not been fixed in 1.24.0, and I also find some related issue #121, I decide to find the reason by myself. And Please forgive me If I misunderstand something.
According to the code below https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/getGenomicAnnotation.R#L183-L187
function get the downstreamDistance defined by user, but I find it seems to be wrong. Because you use the precede
to get the nearest downstream gene and get distance accoring to the code below. So it seems all the distance will be negative ?
https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/getGenomicAnnotation.R#L148-L168
And is it inappropriate to use downstreamIndex <- dd2 > 0 & dd2 < dsd
to get the downstream idx accoring to user's cutoff?
And I also find you actually do not change the content of annotation
, which means whatever downstreamDistance is, the final result will not change because annotation
is the content used instead of detailGenomicAnnotation
.
https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/getGenomicAnnotation.R#L190
The reason the Downstream(<=N)
in the plot will change accoring to user is you did change variable in the plot.
https://github.com/YuLab-SMU/ChIPseeker/blob/10216af5d314641f0a3aad294a3c1163e7ad328e/R/plotAnno.R#L185-L193
But all the content used is from annotation
instead of detailGenomicAnnotation
. And it seems annotation
will not be influenceed by downstreamDistance
https://github.com/YuLab-SMU/ChIPseeker/blob/12dcf7aec24e6e32d1fc1aba622376f2c8f1ea36/R/annotatePeak.R#L162-L164
https://github.com/YuLab-SMU/ChIPseeker/blob/10216af5d314641f0a3aad294a3c1163e7ad328e/R/plotAnno.R#L154-L157
And I also find some issue about it
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb <- TxDb.Hsapiens.UCSC.hg19.knownGene
files <- getSampleFiles()
peakAnno <- annotatePeak(files[[2]], tssRegion=c(-500, 0),
TxDb=TxDb,
level = "gene")
# According to the peakAnno result, there will be about 20 peaks in the downstream.
> peakAnno
Annotated peaks generated by ChIPseeker
2296/2296 peaks were annotated
Genomic Annotation Summary:
Feature Frequency
9 Promoter 0.7404181
4 5' UTR 0.6533101
3 3' UTR 1.5243902
1 1st Exon 0.4355401
7 Other Exon 2.6567944
2 1st Intron 14.5905923
8 Other Intron 31.6637631
6 Downstream (<=300) 0.8710801
5 Distal Intergenic 46.8641115
# But according to the result below, it seems no peak locating downstream
> sum(peakAnno@detailGenomicAnnotation$downstream)
[1] 0
Best wishes Guandong Shang
- The first question of misuse of
precede()
. I agree with it! An example is showed below.
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## get the sample files from chipseeker
files <- getSampleFiles()
peak <- readPeakFile(files[[4]])
## import getGene() from utilities.R
## import .ChIPseekerEnv() from utilities.R
## get the "transcript" from reference genome
features <- getGene(txdb, by="transcript")
> features
GRanges object with 73432 ranges and 2 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <integer>
1 chr19 58858172-58864865 - | 70455
1 chr19 58859832-58874214 - | 70456
10 chr8 18248755-18258723 + | 31944
100 chr20 43248163-43280376 - | 72132
1000 chr18 25530930-25616539 - | 65378
... ... ... ... . ...
9994 chr6 90539619-90584155 + | 25089
9997 chr22 50961997-50964033 - | 75292
9997 chr22 50961997-50964034 - | 75293
9997 chr22 50961997-50964570 - | 75294
9997 chr22 50961997-50964905 - | 75295
tx_name
<character>
1 uc002qsd.4
1 uc002qsf.2
10 uc003wyw.1
100 uc002xmj.3
1000 uc010xbn.1
... ...
9994 uc011dzz.2
9997 uc003bma.3
9997 uc003blz.4
9997 uc021wrz.1
9997 uc021wsa.1
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
## use precede() to get the nearest upstream gene index
if (sameStrand) {
idx <- precede(peak, features)
} else {
idx <- precede(peak, unstrand(features))
}
dd <- ifelse(strand(peF) == "+",
start(peaks) - end(peF),
end(peaks) - start(peF))
head(dd)
> head(dd)
[1] -35070 -3769 -375208 -86583 -196097 -1702
## it is clear that the distance is a negative number
## try an gene out to see the position of peak and gene
head(idx)
> head(idx)
[1] 520 9934 52073 56971 57670 32704
## As we can see the index of
## the neareast upstream gene near the second peak of sample file is 9934.
## so we get a look get the the peak and "9934-transcript"
features[9934]
peak[2]
> features[9934]
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <integer>
126789 chr1 1244466-1247057 + | 82
tx_name
<character>
126789 uc009vjx.3
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> peak[2]
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | V4
<Rle> <IRanges> <Rle> | <factor>
[1] chr1 1243288-1244338 * | MACS_peak_2
V5
<numeric>
[1] 63.19
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
#####################################################################
## it is clear that peak[2]->(1243288,1244338)
## features[9934]->(1244466,1247057)
## peak[2] is ahead of features[9934]
The reason is because precede()
is to get the genes that are at the back of peaks.
precede: For each range in x, precede returns the index of the range in subject that is directly preceded by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.(https://www.rdocumentation.org/packages/GenomicRanges/versions/1.24.1/topics/nearest-methods)
So the negetive distance is not the fault, while getting the wrong genes is the point. The negetive distance only reflects that precede()
gets the gene is at the back of peaks.
follow()
can be used to get the genes in front of peaks.
follow: The opposite of precede, follow returns the index of the range in subject that is directly followed by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.(https://www.rdocumentation.org/packages/GenomicRanges/versions/1.24.1/topics/nearest-methods)
- The second question of not using
annotation
. I do not agree with it! It is right thatgetGenomicAnnotation()
do not change theannotation
. https://github.com/YuLab-SMU/ChIPseeker/blob/d1ca24742edabf0e9b08711244bcb6ec01dd11f5/R/getGenomicAnnotation.R#L169-L189 as it show below it only change thedetailGenomicAnnotation
but when usingannotatePeak()
, it usegetGenomicAnnoStat()
.getGenomicAnnoStat()
change theannotation
instead. https://github.com/YuLab-SMU/ChIPseeker/blob/d1ca24742edabf0e9b08711244bcb6ec01dd11f5/R/plotAnno.R#L191-L208 That is not the reason why options(ChIPseeker.downstreamDistance = 500) do not work, the reason is because https://github.com/YuLab-SMU/ChIPseeker/issues/112#issuecomment-818660296.