SnapATAC icon indicating copy to clipboard operation
SnapATAC copied to clipboard

gmat

Open hjack123 opened this issue 4 years ago • 4 comments

Hi, When creating gmat, is it possible to just make counts under a promoter region of a gene, instead of everywhere on the gene?

Thanks!!!

hjack123 avatar Jan 10 '20 00:01 hjack123

@hjack123 This solution is not exactly what you are looking for, but it comes pretty close, I think. You can use coordinates of promoter regions (e.g. 1kb upstream and 0.5kb downstream of TSS, or your favorite definition of the promoter) instead of the coordinates of genes in the following code: genes = read.table("gencode.vM16.gene.bed"). Such that when you run createGmatFromMat, it will include counts from only those bins/peaks (depending on if you utilize "bmat" or "pmat") that overlap with the promoter regions.

ravipatel4 avatar Jan 13 '20 22:01 ravipatel4

@ravipatel4 thanks I did the same at the end

hjack123 avatar Jan 15 '20 23:01 hjack123

Hi @ravipatel4 ! Could you elaborate a bit more? I would like to count how many reads I get into repetitive regions upstream certains genes. I have imported my own bed file and created the IRange object.

x.sp = createGmatFromMat(
    obj=x.sp,
    input.mat="bmat",
    genes=DAR.select,
    do.par=TRUE,
    num.cores=2
  );

I get this error when I am trying to create the matrix for these intervals:

Error in createGmatFromMat.default(obj = x.sp, input.mat = "bmat", genes = DAR.gr, : genes does not contain gene names

My DAR.select Grange object looks like this :

GRanges object with 4 ranges and 1 metadata column:
        seqnames              ranges strand |       names
           <Rle>           <IRanges>  <Rle> | <character>
  peak2     chr1 143849553-143849709      * |       peak2
  peak3     chr1 105026965-105027121      * |       peak3
  peak4     chr1     4611985-4612141      * |       peak4
  peak5     chr1   51385207-51385363      * |       peak5
  -------
  seqinfo: 20 sequences from an unspecified genome; no seqlengths

while the one created form the read.table("gencode.vM16.gene.bed") looks like this:

GRanges object with 4 ranges and 1 metadata column:
      seqnames              ranges strand |     name
         <Rle>           <IRanges>  <Rle> | <factor>
  [1]     chr3   34650405-34652461      * |     Sox2
  [2]     chr4   55527143-55532466      * |     Klf4
  [3]     chr6 122707489-122714633      * |    Nanog
  [4]    chr17   35506018-35510772      * |   Pou5f1
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

So I basically have no gene names to create the matrix but I have the range and a column called name.

If you could help me figure out what is wrong, this would be great!

Thanks in advance!

JihedC avatar Mar 25 '20 12:03 JihedC

@JihedC, Can you tell me how you generated your GRanges object?

It seems to work for me. I first read my bed intervals into peaksToGenes data frame. And then used the following code to generate a GRanges object.

# Using only first 6 bed intervals for this example.
genes.gr = GRanges(peaksToGenes[1:6,1], IRanges(peaksToGenes[1:6,2], peaksToGenes[1:6,3]), name= peaksToGenes[1:6,4] )

Here is how my GRanges look.

GRanges object with 6 ranges and 1 metadata column:
      seqnames          ranges strand |        name
         <Rle>       <IRanges>  <Rle> | <character>
  [1]     chr1 4492499-4492660      * |       peak1
  [2]     chr1 4492666-4492816      * |       peak2
  [3]     chr1 4496326-4497088      * |       peak3
  [4]     chr1 4525578-4525728      * |       peak4
  [5]     chr1 4571893-4572129      * |       peak5
  [6]     chr1 4600873-4601130      * |       peak6
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

It works for me when I use the createGmatFromMat function using the GRanges created above.

More importantly though, another problem your approach has, apart from the error, is that you are using "bmat" matrix in createGmatFromMat, which, based on my understanding of SnapATAC, means that you won't get reads mapping to your bed intervals, but instead you will get reads mapping to the "bins" overlapping with your bed intervals. So, be careful how you use the "bmat".

If you are specifically interested in the number of peaks mapping to your bed intervals, you would first need to add a cell x repetitive regions count matrix to the snap object. For example,

snaptools snap-add-pmat \
        --snap-file my.snap \
        --peak-file repetitive.bed

where, repetitive.bed will have all regions that you are interested in and my.snap is the snap file you have been using so far. And then load the my.snap file into R and use "pmat" instead of "bmat" in createGmatFromMat.

DISCLAIMER: I haven't tested this approach; I cooked it up based on my understanding of how SnapATAC works. You might need to tweak it depending on the issues you may face. Perhaps that authors might be able to shed more light on this.

Good luck.

ravipatel4 avatar Mar 27 '20 04:03 ravipatel4