GenomicFeatures icon indicating copy to clipboard operation
GenomicFeatures copied to clipboard

How to import pseudogenes from GFF3 file as GRanges object?

Open zhewa opened this issue 2 months ago • 0 comments

Thank you for developing such a powerful package for working with genomes in R. My questions is how do I import pseudogenes, e.g. MTCO2P2, from a NCBI RefSeq human assembly GFF3 file TxDb as GRanges object using GenomicFeatures functions? This entry was not found in the output GRanges objects of functions transcripts, exons, cds, or genes.

Pseudogene MTCO2P2 is on line 3862455 of GFF3 file GCF_000001405.40_GRCh38.p14_genomic.gff: # NC_000018.10 Curated Genomic pseudogene 47853230 47853472 . - . ID=gene-MTCO2P2;Dbxref=GeneID:100873202,HGNC:HGNC:25354;Name=MTCO2P2;description=MT-CO2 pseudogene 2;gbkey=Gene;gene=MTCO2P2;gene_biotype=pseudogene;gene_synonym=HsT4010;pseudo=true

The GFF3 file (Last modified: 2023-10-11 12:39) was downloaded from here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz

library(GenomicFeatures)
db <- makeTxDbFromGFF("GCF_000001405.40_GRCh38.p14_genomic.gff")

# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#     some transcripts have no "transcript_id" attribute ==> their
#     name ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#     the transcript names ("tx_name" column in the TxDb object)
#     imported from the "transcript_id" attribute are not unique
# 3: In .find_exon_cds(exons, cds) :
#     The following transcripts have exons that contain more than
#     one CDS (only the first CDS was kept for each exon):
#     rna-NM_001134939.1, rna-NM_001172437.2, rna-NM_001184961.1,
#     rna-NM_001301020.1, rna-NM_001301302.1, rna-NM_001301371.1,
#     rna-NM_002537.3, rna-NM_004152.3, rna-NM_015068.3,
#     rna-NM_016178.2

tx <- transcripts(db, columns = c("tx_id", "tx_name", "gene_id"))
ex <- exons(db, columns = c("exon_id", "gene_id"))
cds <- cds(db, columns = c("cds_id", "gene_id"))
ge <- genes(db)

txdf <- as.data.frame(tx)
exdf <- as.data.frame(ex)
cdsdf <- as.data.frame(cds)
gedf <- as.data.frame(ge)

"MTCO2P2" %in% txdf$gene_id
# [1] FALSE
47853230 %in% txdf$start
# [1] FALSE
47853472 %in% txdf$end
# [1] FALSE

"MTCO2P2" %in% exdf$gene_id
# [1] FALSE
47853230 %in% exdf$start
# [1] FALSE
47853472 %in% exdf$end
# [1] FALSE

"MTCO2P2" %in% cdsdf$gene_id
# [1] FALSE
47853230 %in% cdsdf$start
# [1] FALSE
47853472 %in% cdsdf$end
# [1] FALSE

"MTCO2P2" %in% gedf$gene_id
# [1] FALSE
47853230 %in% gedf$start
# [1] FALSE
47853472 %in% gedf$end
# [1] FALSE
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
[1] GenomicFeatures_1.54.1 AnnotationDbi_1.64.1  
[3] Biobase_2.62.0         GenomicRanges_1.54.1  
[5] GenomeInfoDb_1.38.2    IRanges_2.36.0        
[7] S4Vectors_0.40.2       BiocGenerics_0.48.1   

loaded via a namespace (and not attached):
 [1] KEGGREST_1.42.0             SummarizedExperiment_1.32.0
 [3] xfun_0.41                   rjson_0.2.21               
 [5] lattice_0.22-5              vctrs_0.6.5                
 [7] tools_4.3.2                 bitops_1.0-7               
 [9] generics_0.1.3              curl_5.2.0                 
[11] parallel_4.3.2              tibble_3.2.1               
[13] fansi_1.0.6                 RSQLite_2.3.4              
[15] blob_1.2.4                  pkgconfig_2.0.3            
[17] Matrix_1.6-4                dbplyr_2.4.0               
[19] lifecycle_1.0.4             GenomeInfoDbData_1.2.11    
[21] compiler_4.3.2              stringr_1.5.1              
[23] Rsamtools_2.18.0            Biostrings_2.70.1          
[25] progress_1.2.3              codetools_0.2-19           
[27] htmltools_0.5.7             RCurl_1.98-1.13            
[29] yaml_2.3.8                  pillar_1.9.0               
[31] crayon_1.5.2                BiocParallel_1.36.0        
[33] DelayedArray_0.28.0         cachem_1.0.8               
[35] abind_1.4-5                 tidyselect_1.2.0           
[37] digest_0.6.33               stringi_1.8.3              
[39] dplyr_1.1.4                 restfulr_0.0.15            
[41] grid_4.3.2                  biomaRt_2.58.0             
[43] fastmap_1.1.1               SparseArray_1.2.2          
[45] cli_3.6.2                   magrittr_2.0.3             
[47] S4Arrays_1.2.0              XML_3.99-0.16              
[49] utf8_1.2.4                  prettyunits_1.2.0          
[51] filelock_1.0.3              rappdirs_0.3.3             
[53] bit64_4.0.5                 rmarkdown_2.25             
[55] XVector_0.42.0              httr_1.4.7                 
[57] matrixStats_1.2.0           bit_4.0.5                  
[59] png_0.1-8                   hms_1.1.3                  
[61] evaluate_0.23               memoise_2.0.1              
[63] knitr_1.45                  BiocIO_1.12.0              
[65] BiocFileCache_2.10.1        rtracklayer_1.62.0         
[67] rlang_1.1.2                 glue_1.6.2                 
[69] DBI_1.1.3                   xml2_1.3.6                 
[71] rstudioapi_0.15.0           R6_2.5.1                   
[73] MatrixGenerics_1.14.0       GenomicAlignments_1.38.0   
[75] zlibbioc_1.48.0    

Thank you

zhewa avatar Apr 10 '24 20:04 zhewa