GenomicFeatures
GenomicFeatures copied to clipboard
How to import pseudogenes from GFF3 file as GRanges object?
Thank you for developing such a powerful package for working with genomes in R. My questions is how do I import pseudogenes, e.g. MTCO2P2, from a NCBI RefSeq human assembly GFF3 file TxDb
as GRanges
object using GenomicFeatures
functions? This entry was not found in the output GRanges
objects of functions transcripts
, exons
, cds
, or genes
.
Pseudogene MTCO2P2 is on line 3862455 of GFF3 file GCF_000001405.40_GRCh38.p14_genomic.gff:
# NC_000018.10 Curated Genomic pseudogene 47853230 47853472 . - . ID=gene-MTCO2P2;Dbxref=GeneID:100873202,HGNC:HGNC:25354;Name=MTCO2P2;description=MT-CO2 pseudogene 2;gbkey=Gene;gene=MTCO2P2;gene_biotype=pseudogene;gene_synonym=HsT4010;pseudo=true
The GFF3 file (Last modified: 2023-10-11 12:39) was downloaded from here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
library(GenomicFeatures)
db <- makeTxDbFromGFF("GCF_000001405.40_GRCh38.p14_genomic.gff")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
# some transcripts have no "transcript_id" attribute ==> their
# name ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
# the transcript names ("tx_name" column in the TxDb object)
# imported from the "transcript_id" attribute are not unique
# 3: In .find_exon_cds(exons, cds) :
# The following transcripts have exons that contain more than
# one CDS (only the first CDS was kept for each exon):
# rna-NM_001134939.1, rna-NM_001172437.2, rna-NM_001184961.1,
# rna-NM_001301020.1, rna-NM_001301302.1, rna-NM_001301371.1,
# rna-NM_002537.3, rna-NM_004152.3, rna-NM_015068.3,
# rna-NM_016178.2
tx <- transcripts(db, columns = c("tx_id", "tx_name", "gene_id"))
ex <- exons(db, columns = c("exon_id", "gene_id"))
cds <- cds(db, columns = c("cds_id", "gene_id"))
ge <- genes(db)
txdf <- as.data.frame(tx)
exdf <- as.data.frame(ex)
cdsdf <- as.data.frame(cds)
gedf <- as.data.frame(ge)
"MTCO2P2" %in% txdf$gene_id
# [1] FALSE
47853230 %in% txdf$start
# [1] FALSE
47853472 %in% txdf$end
# [1] FALSE
"MTCO2P2" %in% exdf$gene_id
# [1] FALSE
47853230 %in% exdf$start
# [1] FALSE
47853472 %in% exdf$end
# [1] FALSE
"MTCO2P2" %in% cdsdf$gene_id
# [1] FALSE
47853230 %in% cdsdf$start
# [1] FALSE
47853472 %in% cdsdf$end
# [1] FALSE
"MTCO2P2" %in% gedf$gene_id
# [1] FALSE
47853230 %in% gedf$start
# [1] FALSE
47853472 %in% gedf$end
# [1] FALSE
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] GenomicFeatures_1.54.1 AnnotationDbi_1.64.1
[3] Biobase_2.62.0 GenomicRanges_1.54.1
[5] GenomeInfoDb_1.38.2 IRanges_2.36.0
[7] S4Vectors_0.40.2 BiocGenerics_0.48.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.42.0 SummarizedExperiment_1.32.0
[3] xfun_0.41 rjson_0.2.21
[5] lattice_0.22-5 vctrs_0.6.5
[7] tools_4.3.2 bitops_1.0-7
[9] generics_0.1.3 curl_5.2.0
[11] parallel_4.3.2 tibble_3.2.1
[13] fansi_1.0.6 RSQLite_2.3.4
[15] blob_1.2.4 pkgconfig_2.0.3
[17] Matrix_1.6-4 dbplyr_2.4.0
[19] lifecycle_1.0.4 GenomeInfoDbData_1.2.11
[21] compiler_4.3.2 stringr_1.5.1
[23] Rsamtools_2.18.0 Biostrings_2.70.1
[25] progress_1.2.3 codetools_0.2-19
[27] htmltools_0.5.7 RCurl_1.98-1.13
[29] yaml_2.3.8 pillar_1.9.0
[31] crayon_1.5.2 BiocParallel_1.36.0
[33] DelayedArray_0.28.0 cachem_1.0.8
[35] abind_1.4-5 tidyselect_1.2.0
[37] digest_0.6.33 stringi_1.8.3
[39] dplyr_1.1.4 restfulr_0.0.15
[41] grid_4.3.2 biomaRt_2.58.0
[43] fastmap_1.1.1 SparseArray_1.2.2
[45] cli_3.6.2 magrittr_2.0.3
[47] S4Arrays_1.2.0 XML_3.99-0.16
[49] utf8_1.2.4 prettyunits_1.2.0
[51] filelock_1.0.3 rappdirs_0.3.3
[53] bit64_4.0.5 rmarkdown_2.25
[55] XVector_0.42.0 httr_1.4.7
[57] matrixStats_1.2.0 bit_4.0.5
[59] png_0.1-8 hms_1.1.3
[61] evaluate_0.23 memoise_2.0.1
[63] knitr_1.45 BiocIO_1.12.0
[65] BiocFileCache_2.10.1 rtracklayer_1.62.0
[67] rlang_1.1.2 glue_1.6.2
[69] DBI_1.1.3 xml2_1.3.6
[71] rstudioapi_0.15.0 R6_2.5.1
[73] MatrixGenerics_1.14.0 GenomicAlignments_1.38.0
[75] zlibbioc_1.48.0
Thank you