gggenes
gggenes copied to clipboard
Gene cluster orientation reverse complements
Dear David, I really appreciate the work you invested into this package, extremely useful!
I have one issue and maybe you know a smart solution, since you better know the possibilities of the package: In the exampe_genes, the genes always show the similar order. In real NCBI Datasets this is however rarely the case. Usually you end up with something like this:
sub_genomeA <- example_genes %>% filter(molecule == "Genome2")
sub_genomeB <- example_genes %>% filter(molecule == "Genome3")
sub_genomeA$start <- -sub_genomeA$start
sub_genomeA$end <- -sub_genomeA$end
baseval <- min(c(sub_genomeA$start,sub_genomeA$end))
sub_genomeA$start <- sub_genomeA$start - baseval
sub_genomeA$end<- sub_genomeA$end - baseval
sub_genome <- rbind(sub_genomeA,sub_genomeB)
ggplot(sub_genome, aes(xmin = start, xmax = end, y = molecule, fill = gene,forward = orientation)) +
geom_gene_arrow() +
facet_wrap(~ molecule, scales = "free", ncol = 1) +
scale_fill_brewer(palette = "Set3") +
theme_genes()
Can you imagine a function to identify such reverse complements and reverse them with a specific function ? Such that the target gene , like genE always points into same direction.
Best , Michael
Hi Michael,
I'm not sure if I'm understanding the problem correctly, so apologies if this isn't what you mean. In the example you've given, the 'NCBI-ified' version of Genome2 differs from the true map in only two ways: the direction of the entire molecule is reversed, and the gene locations are numbered from a different starting point:
library(tidyverse)
library(gggenes)
true_sub_genome_A <- sub_genomeA <- example_genes %>% filter(molecule == "Genome2")
sub_genomeA$start <- -sub_genomeA$start
sub_genomeA$end <- -sub_genomeA$end
baseval <- min(c(sub_genomeA$start,sub_genomeA$end))
sub_genomeA$start <- sub_genomeA$start - baseval
sub_genomeA$end<- sub_genomeA$end - baseval
true_sub_genome_A$molecule <- "TrueGenome2"
sub_genome <- rbind(sub_genomeA,true_sub_genome_A)
ggplot(sub_genome, aes(xmin = start, xmax = end, y = molecule, fill = gene,forward = orientation)) +
geom_gene_arrow() +
facet_wrap(~ molecule, scales = "free", ncol = 1) +
scale_fill_brewer(palette = "Set3") +
theme_genes()
Created on 2022-03-24 by the reprex package (v2.0.1)
If it's unclear whether or not this 'NCBI-ification has been applied', or perhaps has been applied to some molecules but not others, you can't automatically detect and correct for this without some assumption about the true biology – for example, 'every molecule has a genA oriented in a positive direction'. In that situation would would need some code to go though each molecule, detect which ones have a genA in the 'wrong' direction, and reverse the orientation of those molecules. Similarly, to correct for the change in how the gene locations are numbered, you would need to have an assumption based on information from outside the dataset e.g. 'the true location of the 5′ end of protB in each molecule is 14,000'.
If I am understanding correctly, you are looking for some code to perform the first task only (i.e. reverse the orientation of molecules that have a certain known gene oriented the 'wrong' way)?
Thanks for looking at the example. In my examples, mainly from bacterial genomes, it is actually the first point you recognized : the entire molecule with all genes are reversed, the cluster starts from a different starting point. Therefore, only the positions have to be recalculated based on the new starting point.
For most of my research partners, the orientation and exact position (14,000 etc.) of the individual genes dont even matter. They want to show the co-linearity of the gene cluster und this means some of the input genomes need to be reversed, that is the start and end points have to be recalculated.
My question is : (1) What would be a robust measure to detect if the genome cluster is reversed (gene positions) ? (2) What is the best way to transform start and end coordinates ? One way I used in the example above, maybe there are others.
I probably can also find an example from real-world E.coli, to show an example.
Here are a couple of options. First, I'll set up a data frame containing a 'true' and an 'NCBI-ified' version of Genome1.
library(tidyverse)
library(gggenes)
genome1 <- example_genes %>%
filter(molecule == "Genome1") %>%
mutate(molecule = "TrueGenome1") %>%
select(-strand, -orientation)
ncbi_genome1 <- genome1 %>%
mutate_at(c("start", "end"), ~ -.x) %>%
mutate_at(c("start", "end"), ~ .x - min(c(start, end))) %>%
mutate(molecule = "NCBIifiedGenome1")
genomes <- bind_rows(genome1, ncbi_genome1)
ggplot(genomes, aes(xmin = start, xmax = end, y = molecule, fill = gene)) +
geom_gene_arrow() +
facet_wrap(~ molecule, scales = "free", ncol = 1) +
scale_fill_brewer(palette = "Set3") +
theme_genes()
The first method looks at each molecule and checks whether a selected gene (in this example, genA) is oriented in the correct direction. If not, it marks the entire molecule as 'NCBI-ified'.
# Method 1: detect if a certain gene (genA) is oriented in the correct direction
genomes %>%
nest(gene, start, end) %>%
mutate(genAstart = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(start))) %>%
mutate(genAend = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(end))) %>%
mutate(isNCBIified = genAstart > genAend) %>%
select(-genAstart, genAend) %>%
unnest(data)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(gene, start, end)`?
#> # A tibble: 20 × 6
#> molecule gene start end genAend isNCBIified
#> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
#> 1 TrueGenome1 genA 15389 17299 17299 FALSE
#> 2 TrueGenome1 genB 17301 18161 17299 FALSE
#> 3 TrueGenome1 genC 18176 18640 17299 FALSE
#> 4 TrueGenome1 genD 18641 18985 17299 FALSE
#> 5 TrueGenome1 genE 18999 20078 17299 FALSE
#> 6 TrueGenome1 genF 20086 20451 17299 FALSE
#> 7 TrueGenome1 protC 22777 22989 17299 FALSE
#> 8 TrueGenome1 protD 22986 24023 17299 FALSE
#> 9 TrueGenome1 protE 24024 25010 17299 FALSE
#> 10 TrueGenome1 protF 20474 22720 17299 FALSE
#> 11 NCBIifiedGenome1 genA 9621 7711 7711 TRUE
#> 12 NCBIifiedGenome1 genB 7709 6849 7711 TRUE
#> 13 NCBIifiedGenome1 genC 6834 6370 7711 TRUE
#> 14 NCBIifiedGenome1 genD 6369 6025 7711 TRUE
#> 15 NCBIifiedGenome1 genE 6011 4932 7711 TRUE
#> 16 NCBIifiedGenome1 genF 4924 4559 7711 TRUE
#> 17 NCBIifiedGenome1 protC 2233 2021 7711 TRUE
#> 18 NCBIifiedGenome1 protD 2024 987 7711 TRUE
#> 19 NCBIifiedGenome1 protE 986 0 7711 TRUE
#> 20 NCBIifiedGenome1 protF 4536 2290 7711 TRUE
The second method looks at two different genes and checks if they appear the correct order. In this example, protF should start after genC.
# Method 2: detect if a certain gene (protF) starts after another gene (genC)
genomes %>%
nest(gene, start, end) %>%
mutate(protFstart = map_dbl(data, ~ filter(.x, gene == "protF") %>% pull(start))) %>%
mutate(genCend = map_dbl(data, ~ filter(.x, gene == "genC") %>% pull(end))) %>%
mutate(isNCBIified = protFstart < genCend) %>%
select(-protFstart, genCend) %>%
unnest(data)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(gene, start, end)`?
#> # A tibble: 20 × 6
#> molecule gene start end genCend isNCBIified
#> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
#> 1 TrueGenome1 genA 15389 17299 18640 FALSE
#> 2 TrueGenome1 genB 17301 18161 18640 FALSE
#> 3 TrueGenome1 genC 18176 18640 18640 FALSE
#> 4 TrueGenome1 genD 18641 18985 18640 FALSE
#> 5 TrueGenome1 genE 18999 20078 18640 FALSE
#> 6 TrueGenome1 genF 20086 20451 18640 FALSE
#> 7 TrueGenome1 protC 22777 22989 18640 FALSE
#> 8 TrueGenome1 protD 22986 24023 18640 FALSE
#> 9 TrueGenome1 protE 24024 25010 18640 FALSE
#> 10 TrueGenome1 protF 20474 22720 18640 FALSE
#> 11 NCBIifiedGenome1 genA 9621 7711 6370 TRUE
#> 12 NCBIifiedGenome1 genB 7709 6849 6370 TRUE
#> 13 NCBIifiedGenome1 genC 6834 6370 6370 TRUE
#> 14 NCBIifiedGenome1 genD 6369 6025 6370 TRUE
#> 15 NCBIifiedGenome1 genE 6011 4932 6370 TRUE
#> 16 NCBIifiedGenome1 genF 4924 4559 6370 TRUE
#> 17 NCBIifiedGenome1 protC 2233 2021 6370 TRUE
#> 18 NCBIifiedGenome1 protD 2024 987 6370 TRUE
#> 19 NCBIifiedGenome1 protE 986 0 6370 TRUE
#> 20 NCBIifiedGenome1 protF 4536 2290 6370 TRUE
In terms of transforming the start and end coordinates, the simplest way would be to just flip the sign of all the start and end coordinates for molecules that have been identified as 'NCBI-ified'.
# Identify NCBIified molecules using Method 1, then flip the sign of all start
# and end coordinates for those molecules only
genomes %>%
nest(gene, start, end) %>%
mutate(genAstart = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(start))) %>%
mutate(genAend = map_dbl(data, ~ filter(.x, gene == "genA") %>% pull(end))) %>%
mutate(isNCBIified = genAstart > genAend) %>%
select(-genAstart, genAend) %>%
unnest(data) %>%
mutate(start = if_else(isNCBIified, -start, start)) %>%
mutate(end = if_else(isNCBIified, -end, end)) %>%
ggplot(aes(xmin = start, xmax = end, y = molecule, fill = gene)) +
geom_gene_arrow() +
facet_wrap(~ molecule, scales = "free", ncol = 1) +
scale_fill_brewer(palette = "Set3") +
theme_genes()
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(gene, start, end)`?
Created on 2022-03-26 by the reprex package (v2.0.1)
You would probably also want to transpose the coordinates by adding or subtracting a constant, using a similar method.