gggenes icon indicating copy to clipboard operation
gggenes copied to clipboard

Gene cluster orientation reverse complements

Open mweberr opened this issue 2 years ago • 3 comments

Dear David, I really appreciate the work you invested into this package, extremely useful!

I have one issue and maybe you know a smart solution, since you better know the possibilities of the package: In the exampe_genes, the genes always show the similar order. In real NCBI Datasets this is however rarely the case. Usually you end up with something like this:

sub_genomeA <- example_genes %>% filter(molecule == "Genome2")
sub_genomeB <- example_genes %>% filter(molecule == "Genome3")

sub_genomeA$start <- -sub_genomeA$start
sub_genomeA$end <- -sub_genomeA$end
baseval <- min(c(sub_genomeA$start,sub_genomeA$end))
sub_genomeA$start <- sub_genomeA$start - baseval
sub_genomeA$end<- sub_genomeA$end - baseval

sub_genome <- rbind(sub_genomeA,sub_genomeB)

ggplot(sub_genome, aes(xmin = start, xmax = end, y = molecule, fill = gene,forward = orientation)) +
  geom_gene_arrow() +
  facet_wrap(~ molecule, scales = "free", ncol = 1) +
  scale_fill_brewer(palette = "Set3") +
  theme_genes()

Can you imagine a function to identify such reverse complements and reverse them with a specific function ? Such that the target gene , like genE always points into same direction.

Best , Michael

mweberr avatar Mar 14 '22 12:03 mweberr

Hi Michael,

I'm not sure if I'm understanding the problem correctly, so apologies if this isn't what you mean. In the example you've given, the 'NCBI-ified' version of Genome2 differs from the true map in only two ways: the direction of the entire molecule is reversed, and the gene locations are numbered from a different starting point:

library(tidyverse)
library(gggenes)

true_sub_genome_A <- sub_genomeA <- example_genes %>% filter(molecule == "Genome2")

sub_genomeA$start <- -sub_genomeA$start
sub_genomeA$end <- -sub_genomeA$end
baseval <- min(c(sub_genomeA$start,sub_genomeA$end))
sub_genomeA$start <- sub_genomeA$start - baseval
sub_genomeA$end<- sub_genomeA$end - baseval

true_sub_genome_A$molecule <- "TrueGenome2"

sub_genome <- rbind(sub_genomeA,true_sub_genome_A)

ggplot(sub_genome, aes(xmin = start, xmax = end, y = molecule, fill = gene,forward = orientation)) +
  geom_gene_arrow() +
  facet_wrap(~ molecule, scales = "free", ncol = 1) +
  scale_fill_brewer(palette = "Set3") +
  theme_genes()

Created on 2022-03-24 by the reprex package (v2.0.1)

If it's unclear whether or not this 'NCBI-ification has been applied', or perhaps has been applied to some molecules but not others, you can't automatically detect and correct for this without some assumption about the true biology – for example, 'every molecule has a genA oriented in a positive direction'. In that situation would would need some code to go though each molecule, detect which ones have a genA in the 'wrong' direction, and reverse the orientation of those molecules. Similarly, to correct for the change in how the gene locations are numbered, you would need to have an assumption based on information from outside the dataset e.g. 'the true location of the 5′ end of protB in each molecule is 14,000'.

If I am understanding correctly, you are looking for some code to perform the first task only (i.e. reverse the orientation of molecules that have a certain known gene oriented the 'wrong' way)?

wilkox avatar Mar 24 '22 10:03 wilkox

Thanks for looking at the example. In my examples, mainly from bacterial genomes, it is actually the first point you recognized : the entire molecule with all genes are reversed, the cluster starts from a different starting point. Therefore, only the positions have to be recalculated based on the new starting point.

For most of my research partners, the orientation and exact position (14,000 etc.) of the individual genes dont even matter. They want to show the co-linearity of the gene cluster und this means some of the input genomes need to be reversed, that is the start and end points have to be recalculated.

My question is : (1) What would be a robust measure to detect if the genome cluster is reversed (gene positions) ? (2) What is the best way to transform start and end coordinates ? One way I used in the example above, maybe there are others.

I probably can also find an example from real-world E.coli, to show an example.

mweberr avatar Mar 24 '22 12:03 mweberr

Here are a couple of options. First, I'll set up a data frame containing a 'true' and an 'NCBI-ified' version of Genome1.

library(tidyverse)
library(gggenes)

genome1 <- example_genes %>%
  filter(molecule == "Genome1") %>%
  mutate(molecule = "TrueGenome1") %>%
  select(-strand, -orientation)

ncbi_genome1 <- genome1 %>%
  mutate_at(c("start", "end"), ~ -.x) %>%
  mutate_at(c("start", "end"), ~ .x - min(c(start, end))) %>%
  mutate(molecule = "NCBIifiedGenome1")

genomes <- bind_rows(genome1, ncbi_genome1)

ggplot(genomes, aes(xmin = start, xmax = end, y = molecule, fill = gene)) +
  geom_gene_arrow() +
  facet_wrap(~ molecule, scales = "free", ncol = 1) +
  scale_fill_brewer(palette = "Set3") +
  theme_genes()

The first method looks at each molecule and checks whether a selected gene (in this example, genA) is oriented in the correct direction. If not, it marks the entire molecule as 'NCBI-ified'.

# Method 1: detect if a certain gene (genA) is oriented in the correct direction
genomes %>%
  nest(gene, start, end) %>%
  mutate(genAstart = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(start))) %>%
  mutate(genAend = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(end))) %>%
  mutate(isNCBIified = genAstart > genAend) %>%
  select(-genAstart, genAend) %>%
  unnest(data)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(gene, start, end)`?
#> # A tibble: 20 × 6
#>    molecule         gene  start   end genAend isNCBIified
#>    <chr>            <chr> <dbl> <dbl>   <dbl> <lgl>      
#>  1 TrueGenome1      genA  15389 17299   17299 FALSE      
#>  2 TrueGenome1      genB  17301 18161   17299 FALSE      
#>  3 TrueGenome1      genC  18176 18640   17299 FALSE      
#>  4 TrueGenome1      genD  18641 18985   17299 FALSE      
#>  5 TrueGenome1      genE  18999 20078   17299 FALSE      
#>  6 TrueGenome1      genF  20086 20451   17299 FALSE      
#>  7 TrueGenome1      protC 22777 22989   17299 FALSE      
#>  8 TrueGenome1      protD 22986 24023   17299 FALSE      
#>  9 TrueGenome1      protE 24024 25010   17299 FALSE      
#> 10 TrueGenome1      protF 20474 22720   17299 FALSE      
#> 11 NCBIifiedGenome1 genA   9621  7711    7711 TRUE       
#> 12 NCBIifiedGenome1 genB   7709  6849    7711 TRUE       
#> 13 NCBIifiedGenome1 genC   6834  6370    7711 TRUE       
#> 14 NCBIifiedGenome1 genD   6369  6025    7711 TRUE       
#> 15 NCBIifiedGenome1 genE   6011  4932    7711 TRUE       
#> 16 NCBIifiedGenome1 genF   4924  4559    7711 TRUE       
#> 17 NCBIifiedGenome1 protC  2233  2021    7711 TRUE       
#> 18 NCBIifiedGenome1 protD  2024   987    7711 TRUE       
#> 19 NCBIifiedGenome1 protE   986     0    7711 TRUE       
#> 20 NCBIifiedGenome1 protF  4536  2290    7711 TRUE

The second method looks at two different genes and checks if they appear the correct order. In this example, protF should start after genC.

# Method 2: detect if a certain gene (protF) starts after another gene (genC)
genomes %>%
  nest(gene, start, end) %>%
  mutate(protFstart = map_dbl(data, ~ filter(.x, gene == "protF") %>% pull(start))) %>%
  mutate(genCend = map_dbl(data, ~ filter(.x, gene == "genC") %>% pull(end))) %>%
  mutate(isNCBIified = protFstart < genCend) %>%
  select(-protFstart, genCend) %>%
  unnest(data)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(gene, start, end)`?
#> # A tibble: 20 × 6
#>    molecule         gene  start   end genCend isNCBIified
#>    <chr>            <chr> <dbl> <dbl>   <dbl> <lgl>      
#>  1 TrueGenome1      genA  15389 17299   18640 FALSE      
#>  2 TrueGenome1      genB  17301 18161   18640 FALSE      
#>  3 TrueGenome1      genC  18176 18640   18640 FALSE      
#>  4 TrueGenome1      genD  18641 18985   18640 FALSE      
#>  5 TrueGenome1      genE  18999 20078   18640 FALSE      
#>  6 TrueGenome1      genF  20086 20451   18640 FALSE      
#>  7 TrueGenome1      protC 22777 22989   18640 FALSE      
#>  8 TrueGenome1      protD 22986 24023   18640 FALSE      
#>  9 TrueGenome1      protE 24024 25010   18640 FALSE      
#> 10 TrueGenome1      protF 20474 22720   18640 FALSE      
#> 11 NCBIifiedGenome1 genA   9621  7711    6370 TRUE       
#> 12 NCBIifiedGenome1 genB   7709  6849    6370 TRUE       
#> 13 NCBIifiedGenome1 genC   6834  6370    6370 TRUE       
#> 14 NCBIifiedGenome1 genD   6369  6025    6370 TRUE       
#> 15 NCBIifiedGenome1 genE   6011  4932    6370 TRUE       
#> 16 NCBIifiedGenome1 genF   4924  4559    6370 TRUE       
#> 17 NCBIifiedGenome1 protC  2233  2021    6370 TRUE       
#> 18 NCBIifiedGenome1 protD  2024   987    6370 TRUE       
#> 19 NCBIifiedGenome1 protE   986     0    6370 TRUE       
#> 20 NCBIifiedGenome1 protF  4536  2290    6370 TRUE

In terms of transforming the start and end coordinates, the simplest way would be to just flip the sign of all the start and end coordinates for molecules that have been identified as 'NCBI-ified'.

# Identify NCBIified molecules using Method 1, then flip the sign of all start
# and end coordinates for those molecules only
genomes %>%
  nest(gene, start, end) %>%
  mutate(genAstart = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(start))) %>%
  mutate(genAend = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(end))) %>%
  mutate(isNCBIified = genAstart > genAend) %>%
  select(-genAstart, genAend) %>%
  unnest(data) %>%
  mutate(start = if_else(isNCBIified, -start, start)) %>%
  mutate(end = if_else(isNCBIified, -end, end)) %>%
  ggplot(aes(xmin = start, xmax = end, y = molecule, fill = gene)) +
    geom_gene_arrow() +
    facet_wrap(~ molecule, scales = "free", ncol = 1) +
    scale_fill_brewer(palette = "Set3") +
    theme_genes()
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(gene, start, end)`?

Created on 2022-03-26 by the reprex package (v2.0.1)

You would probably also want to transpose the coordinates by adding or subtracting a constant, using a similar method.

wilkox avatar Mar 26 '22 04:03 wilkox