BPCells icon indicating copy to clipboard operation
BPCells copied to clipboard

Incoherence with extendUpstream/extendDownstream parameters

Open Baboon61 opened this issue 11 months ago • 4 comments

Hello,

Thank you for this proxy of ArchR, it eases the understanding of the bioinformatics analysis !

In ArchR extendUpstream/extendDownstream are :

The minimum and maximum number of basepairs upstream of the transcription start site to consider for gene activity score calculation

In ArchR code one comment state (when extended the gene activity lengths) :

https://github.com/GreenleafLab/ArchR/blob/d9e741c980c7c64e5348c97a74d146cc95f8ba76/R/MatrixGeneScores.R#L341

We will test when genes pass by another gene promoter

In your version you extended the "gene activity" till the nereast gene BODY not promoter.

https://github.com/bnprks/BPCells/blob/f94f3f4ebcca5dea48e13756284950f37482342f/r/R/geneScores.R#L133

Extend up 1kb - 100kb upstream + downstream, but stop if we hit a neighboring gene

Which creates a discrepency between ArchR Gene Score Matrix results and yours.

As ArchR maintenance is quite slow atm, what do you think would be the best approach ?

I think your version makes more sense as an antisense gene located right after a sense gene will suck up genes from the latter, especially if that sense gene is massive.

Baboon61 avatar Jan 20 '25 11:01 Baboon61

Hi @Baboon61, thanks for your question and other issue reports -- it's very helpful to get user feedback.

In general, BPCells and ArchR should produce the same results for gene activity scores -- I've tested this on some real-world datasets to confirm that BPCells matches the ArchR values.

As to your specific question: looking just before the BPCells line you've found, you'll see that the gene variable is modified to be extended 5kb upstream. Therefore the distance calculations are in fact distance to nearest promoter (or gene end, depending on gene orientation). The code comment might be slightly misleading, but I believe the actual results match how ArchR does the calculation.

https://github.com/bnprks/BPCells/blob/f94f3f4ebcca5dea48e13756284950f37482342f/r/R/geneScores.R#L130-L134

To my knowledge, the only differences between BPCells and ArchR for gene scores should be:

  • ArchR's first tile of a chromosome runs only 499 bases (1-499), with all the rest running the correct 500bp. This creates a 1bp offset and so the tile matrix ArchR uses can differ from the tile matrix BPCells uses unless fragment coordinates are offset manually by 1bp. Since gene scores are calculated from the tile matrix, this can spill over into slightly different gene score matrices for fragments on the edge of a 500bp tile boundary.
  • ArchR rounds gene scores to 3 decimal places, while BPCells leaves them to full precision
  • BPCells handles nested genes differently (unless addArchRBug=TRUE is set)
  • If reading from a 10x-generated file, ArchR's end coordinates are off by 1bp due to some incorrect documentation on the 10x side.

The scripts I used to test BPCells/ArchR equivalence are in the tests/real_data folder, e.g. bpcells_archr_compatible.R has an example of the code I used to test that the results are identical when accounting for these ArchR quirks.

-Ben

bnprks avatar Jan 23 '25 07:01 bnprks

Hello Ben, thanks a lot for taking the time and for such detailed answer.

With my dataset for the same parameters and bin size (500), if I set in ArchR addGeneScoreMatrix :

extendUpstream = c(0, 0)
extendDownstream = c(0, 0)

And in BPCells gene_score_tiles_archr :

extension_lengths$upstream <- pmin(pmax(extension_lengths$upstream - tile_width, 0), 0)
extension_lengths$downstream <- pmin(pmax(extension_lengths$downstream - tile_width, 0), 0)

When I plot the rowSums I get an almost perfect diagonal line of gene counts (same number and order of genes and cells)

Image

But the more I increase the parameters, the more inflated counts I get from ArchR.

Here, an example with

extendUpstream = c(1000, 10000)
extendDownstream = c(1000, 10000)
extension_lengths$upstream <- pmin(pmax(extension_lengths$upstream - tile_width, 1000), 10000)
extension_lengths$downstream <- pmin(pmax(extension_lengths$downstream - tile_width, 1000), 10000)

Image

It got solved adding what you mention as addArchRBug=TRUE. Could you describe a bit more what is going on with this feature ? Thanks !

About the other differences you mentioned between BPCells and ArchR, I think they are subtle enough to be ignored.

Baboon61 avatar Jan 23 '25 15:01 Baboon61

Hi @Baboon61, apologies for not noticing your reply promptly.

When ArchR encounters nested genes, it computes the extensions very strangely. If we look at the nested genes SYN2 and TIMP4 which look like this (not to scale):

       |-------------------------|                         SYN2
                |------|                                   TIMP4

Then if ArchR is trying to extend at least 1k in each direction and up to 100k, then it extends like this (not to scale):

+++++++|-------------------------|+                        SYN2
               +|------|+++++++++++++++++++++++            TIMP4

Notice that SYN2 gets only the minimal extension on the right because of the presence of TIMP4, while TIMP4 is freely allowed to extend through the body of SYN2 on the right. This is not dependent on which strand SYN2 and TIMP4 are on, it will always extend the outer gene to the left (smaller coordinates) and the inner gene to the right (larger coordinates).

I think when ArchR was programmed, there was an implicit assumption that if a gene 1 start comes before gene 2 start, then gene 1 end must also come before gene 2 end. But this is false for nested genes.

When addArchRBug=FALSE, BPCells would instead extend like this:

+++++++|-------------------------|+++++++                  SYN2
               +|------|+                                  TIMP4

In this case, BPCells counts the nested gene as having distance 0 to its nearest genes, and ignores the inside gene when calculating distances to neighboring genes for the outside gene. I think this is a more logical approach, but it does noticeably break ArchR compatibility as you've observed so I have the addArchRBug option available if compatibility is most important.

bnprks avatar Jan 31 '25 18:01 bnprks

Hi @bnprks, thank you very much for the clear explanation ! While your method does not gives much freedom to inside genes I think it is the most logical approach for these specific situations. Thanks again.

Baboon61 avatar Feb 05 '25 09:02 Baboon61