Rsamtools icon indicating copy to clipboard operation
Rsamtools copied to clipboard

Error with getSeq from opened fasta file (Rsamtools 2.0.2)

Open farshadf opened this issue 5 years ago • 7 comments

Hi, I am using Rsamtools 2.0.2, and I have faced an issue on reading sequences from fasta files.

My issue is similar to https://github.com/Bioconductor/Rsamtools/issues/5 , but not exactly. When I try to "getSeq" from fasta file I get an error message. This is my script:

GTFfile = "Homo_sapiens.GRCh37.87.gtf"
FASTAfile = "Homo_sapiens.GRCh37.dna.toplevel.fa"

FASTA <- FaFile(FASTAfile)
open(FASTA)

getSeq(FASTA, GRanges("chr1", IRanges(1,5)))

I get this: Error in value[3L] : record 1 (chr1:1-5) failed file: Homo_sapiens.GRCh37.dna.toplevel.fa

If I run it in Windows, I get this: [E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?) Error in value[3L] : record 1 (chr1:1-5) failed file: Homo_sapiens.GRCh37.dna.toplevel.fa

The reference files are standard ensembl reference files and are downloaded from ftp://ftp.ensembl.org/pub/grch37/release-87/fasta/homo_sapiens/dna .

Thank you for your help. Farshad

farshadf avatar Oct 04 '19 01:10 farshadf

Can you please provide your sessionInfo()? Thanks.

hpages avatar Oct 04 '19 03:10 hpages

Sorry, I should have added that already:


R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS release 6.9 (Final)

Matrix products: default
BLAS/LAPACK: /data/miniconda3/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] xtable_1.8-3         rtracklayer_1.42.2   Rsamtools_1.34.0    
 [4] Biostrings_2.50.2    XVector_0.22.0       GenomicRanges_1.34.0
 [7] GenomeInfoDb_1.18.1  IRanges_2.16.0       S4Vectors_0.20.1    
[10] BiocGenerics_0.28.0 

loaded via a namespace (and not attached):
 [1] zlibbioc_1.28.0             GenomicAlignments_1.18.1   
 [3] BiocParallel_1.14.2         BSgenome_1.50.0            
 [5] lattice_0.20-38             tools_3.5.1                
 [7] SummarizedExperiment_1.12.0 grid_3.5.1                 
 [9] Biobase_2.42.0              matrixStats_0.54.0         
[11] yaml_2.2.0                  Matrix_1.2-15              
[13] GenomeInfoDbData_1.2.1      BiocManager_1.30.4         
[15] bitops_1.0-6                RCurl_1.95-4.11            
[17] DelayedArray_0.8.0          compiler_3.5.1             
[19] XML_3.98-1.16              

farshadf avatar Oct 04 '19 16:10 farshadf

Thanks but in your original post you said you were using Rsamtools 2.0.2 but now your sessionInfo() reports that you were using Rsamtools 1.34.0. We need to see the full transcript of the R session that is producing the error. The transcript should contain the call to sessionInfo() at the end of the session. So we know exactly what command you used, what error you got, on which OS you are, and what version of Bioconductor/R you use. Thanks!

hpages avatar Oct 04 '19 19:10 hpages

I am having a similar issue. The error happens from chromosome 13 onwards. This is in windows

library(Rsamtools)
> fastaFile<-"F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta"
> FASTA <- Rsamtools::FaFile(fastaFile)
> FASTA
class: FaFile 
path: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
index: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta.fai
gzindex: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta.gzi
isOpen: FALSE 
yieldSize: NA 
> for (chrom in as.character(c(1:22,"X","Y"))){
+   try(
+     print(getSeq(FASTA, GRanges(chrom, IRanges(41194312,41194322))))
+   )
+ }
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 TTTTGTTTTGT                                                                                                                               1
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 GCCATAAAAAA                                                                                                                               2
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 GTATTTACAAA                                                                                                                               3
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 CCCAAAGGAAA                                                                                                                               4
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 TGGATAAGAAT                                                                                                                               5
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 GTAATGCCATT                                                                                                                               6
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 TTTGATTTAAG                                                                                                                               7
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 ATTCCAGCACC                                                                                                                               8
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 TAAGTTGGGGA                                                                                                                               9
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 NNNNNNNNNNN                                                                                                                               10
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 ATTTTTAAAAC                                                                                                                               11
  A DNAStringSet instance of length 1
    width seq                                                                                                                                       names               
[1]    11 TTAAGCAATAT                                                                                                                               12
Error in value[[3L]](cond) :  record 1 (13:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (14:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (15:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (16:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (17:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (18:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (19:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (20:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (21:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (22:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (X:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) :  record 1 (Y:41194312-41194322) failed
  file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta

Here's my sessionInfo():

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsamtools_2.2.3      Biostrings_2.54.0    XVector_0.26.0       GenomicRanges_1.38.0 GenomeInfoDb_1.22.1  IRanges_2.20.2       S4Vectors_0.24.4    
[8] BiocGenerics_0.32.0 

loaded via a namespace (and not attached):
[1] bitops_1.0-6           zlibbioc_1.32.0        data.table_1.12.8      BiocParallel_1.20.1    tools_3.6.3            RCurl_1.98-1.2         compiler_3.6.3        
[8] GenomeInfoDbData_1.2.2

ahwanpandey avatar Jul 20 '20 04:07 ahwanpandey

I installed Rsamtools_2.4.0 and the same error persists.

ahwanpandey avatar Jul 21 '20 00:07 ahwanpandey

Another thing I tried was to delete the fasta index and let Rsamtools create the index while running. The index it creates is weird from chr14 onwards.

It appends "1844674407" i.e. 2^64 to the third column:

1       249250621       52      60      61
2       243199373       253404903       60      61
3       198022430       500657651       60      61
4       191154276       701980507       60      61
5       180915260       896320740       60      61
6       171115067       1080251307      60      61
7       159138663       1254218344      60      61
8       146364022       1416009371      60      61
9       141213431       1564812846      60      61
10      135534747       1708379889      60      61
11      135006516       1846173603      60      61
12      133851895       1983430282      60      61
13      115169878       2119513096      60      61
14      107349540       18446744071651186846    60      61
15      102531392       18446744071760325599    60      61
16      90354753        18446744071864565901    60      61
17      81195210        18446744071956426620    60      61
18      78077248        18446744072038975137    60      61
19      59128983        18446744072118353726    60      61
20      63025520        18446744072178468246    60      61
21      48129895        18446744072242544245    60      61
22      51304566        18446744072291476358    60      61
X       155270560       18446744072343636053    60      61
Y       59373566        18446744072501494513    60      61
MT      16569   18446744072561857717    70      71
GL000207.1      4262    18446744072561874585    60      61
GL000226.1      15008   18446744072561878981    60      61
GL000229.1      19913   18446744072561894302    60      61
GL000231.1      27386   18446744072561914609    60      61
GL000210.1      27682   18446744072561942514    60      61
GL000239.1      33824   18446744072561970720    60      61
GL000235.1      34474   18446744072562005170    60      61
GL000201.1      36148   18446744072562040281    60      61
GL000247.1      36422   18446744072562077094    60      61
GL000245.1      36651   18446744072562114186    60      61
GL000197.1      37175   18446744072562151510    60      61
GL000203.1      37498   18446744072562189367    60      61
GL000246.1      38154   18446744072562227552    60      61
GL000249.1      38502   18446744072562266404    60      61
GL000196.1      38914   18446744072562305610    60      61
GL000248.1      39786   18446744072562345235    60      61
GL000244.1      39929   18446744072562385747    60      61
GL000238.1      39939   18446744072562426404    60      61
GL000202.1      40103   18446744072562467071    60      61
GL000234.1      40531   18446744072562507905    60      61
GL000232.1      40652   18446744072562549174    60      61
GL000206.1      41001   18446744072562590566    60      61
GL000240.1      41933   18446744072562632313    60      61
GL000236.1      41934   18446744072562675007    60      61
GL000241.1      42152   18446744072562717702    60      61
GL000243.1      43341   18446744072562760619    60      61
GL000242.1      43523   18446744072562804745    60      61
GL000230.1      43691   18446744072562849056    60      61
GL000237.1      45867   18446744072562893538    60      61
GL000233.1      45941   18446744072562940232    60      61
GL000204.1      81310   18446744072562987001    60      61
GL000198.1      90085   18446744072563069729    60      61
GL000208.1      92689   18446744072563161378    60      61
GL000191.1      106433  18446744072563255675    60      61
GL000227.1      128374  18446744072563363945    60      61
GL000228.1      129120  18446744072563494522    60      61
GL000214.1      137718  18446744072563625857    60      61
GL000221.1      155397  18446744072563765934    60      61
GL000209.1      159169  18446744072563923984    60      61
GL000218.1      161147  18446744072564085869    60      61
GL000220.1      161802  18446744072564249765    60      61
GL000213.1      164239  18446744072564414327    60      61
GL000211.1      166566  18446744072564581367    60      61
GL000199.1      169874  18446744072564750773    60      61
GL000217.1      172149  18446744072564923542    60      61
GL000216.1      172294  18446744072565098624    60      61
GL000215.1      172545  18446744072565273853    60      61
GL000205.1      174588  18446744072565449337    60      61
GL000219.1      179198  18446744072565626898    60      61
GL000224.1      179693  18446744072565809146    60      61
GL000223.1      180455  18446744072565991897    60      61
GL000195.1      182896  18446744072566175423    60      61
GL000212.1      186858  18446744072566361431    60      61
GL000222.1      186861  18446744072566551467    60      61
GL000200.1      187035  18446744072566741506    60      61
GL000193.1      189789  18446744072566931722    60      61
GL000194.1      191469  18446744072567124738    60      61
GL000225.1      211173  18446744072567319462    60      61
GL000192.1      547496  18446744072567534218    60      61

ahwanpandey avatar Jul 21 '20 03:07 ahwanpandey

I have the same issue on Windows; I'll possibly try to dig around with a debugger in the C part of the library at some point later this week, and also try running on different operating systems.

My sessionInfo():

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] Rsamtools_2.8.0             Biostrings_2.60.1           XVector_0.32.0              motifmatchr_1.14.0          chromVAR_1.14.0            
 [6] SummarizedExperiment_1.22.0 Biobase_2.52.0              GenomicRanges_1.44.0        GenomeInfoDb_1.28.0         modules_0.10.0             
[11] ggplot2_3.3.3               tibble_3.1.2                irlba_2.3.3                 uwot_0.1.10                 HDF5Array_1.20.0           
[16] DelayedArray_0.18.0         IRanges_2.26.0              S4Vectors_0.30.0            MatrixGenerics_1.4.0        matrixStats_0.59.0         
[21] BiocGenerics_0.38.0         Matrix_1.3-4                rhdf5_2.36.0               

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                DirichletMultinomial_1.34.0 TFBSTools_1.30.0            bit64_4.0.5                 httr_1.4.2                 
 [6] tools_4.1.0                 utf8_1.2.1                  R6_2.5.0                    DT_0.18                     lazyeval_0.2.2             
[11] seqLogo_1.58.0              DBI_1.1.1                   colorspace_2.0-1            rhdf5filters_1.4.0          withr_2.4.2                
[16] tidyselect_1.1.1            bit_4.0.4                   compiler_4.1.0              plotly_4.9.4                rtracklayer_1.52.0         
[21] caTools_1.18.2              scales_1.1.1                readr_1.4.0                 stringr_1.4.0               digest_0.6.27              
[26] R.utils_2.10.1              pkgconfig_2.0.3             htmltools_0.5.1.1           fastmap_1.1.0               BSgenome_1.60.0            
[31] htmlwidgets_1.5.3           rlang_0.4.11                rstudioapi_0.13             RSQLite_2.2.7               shiny_1.6.0                
[36] BiocIO_1.2.0                generics_0.1.0              jsonlite_1.7.2              BiocParallel_1.26.0         gtools_3.9.2               
[41] R.oo_1.24.0                 dplyr_1.0.6                 RCurl_1.98-1.3              magrittr_2.0.1              GO.db_3.13.0               
[46] GenomeInfoDbData_1.2.6      Rcpp_1.0.6                  munsell_0.5.0               Rhdf5lib_1.14.1             fansi_0.5.0                
[51] R.methodsS3_1.8.1           lifecycle_1.0.0             stringi_1.6.2               yaml_2.2.1                  zlibbioc_1.38.0            
[56] plyr_1.8.6                  grid_4.1.0                  blob_1.2.1                  promises_1.2.0.1            crayon_1.4.1               
[61] miniUI_0.1.1.1              CNEr_1.28.0                 lattice_0.20-44             annotate_1.70.0             KEGGREST_1.32.0            
[66] hms_1.1.0                   pillar_1.6.1                rjson_0.2.20                JASPAR2016_1.20.0           reshape2_1.4.4             
[71] TFMPvalue_0.0.8             XML_3.99-0.6                glue_1.4.2                  data.table_1.14.0           renv_0.13.2                
[76] BiocManager_1.30.15         png_0.1-7                   vctrs_0.3.8                 httpuv_1.6.1                tidyr_1.1.3                
[81] gtable_0.3.0                poweRlaw_0.70.6             purrr_0.3.4                 cachem_1.0.5                mime_0.10                  
[86] xtable_1.8-4                restfulr_0.0.13             pracma_2.3.3                later_1.2.0                 viridisLite_0.4.0          
[91] GenomicAlignments_1.28.0    AnnotationDbi_1.54.1        memoise_2.0.0               ellipsis_0.3.2

I get the error:

> scanFa(grcm39_genome, peak)
Error in value[[3L]](cond) :  record 1 (AY172335.1:71-372) failed
  file: 2021-02-22_sciatac/GRCm39.fna

when using a pre-built index (which looks valid). If I move the pre-built index and let RSamtools build it, the index is similarly invalid-looking (very high constant offset):

CM000994.3	195154279	82	80	81
GL456210.1	169725	197593918	80	81
GL456211.1	241735	197765893	80	81
GL456212.1	153618	198010778	80	81
GL456221.1	206961	198166445	80	81
MU069434.1	8412	198376122	80	81
GL456239.1	40056	198384768	80	81
CM000995.3	181755017	198425407	80	81
CM000996.3	159745316	382452444	80	81
CM000997.3	156860686	544194659	80	81
JH584295.1	1976	703016227	80	81
CM000998.3	151758149	703018310	80	81
JH584296.1	199368	856673564	80	81
JH584297.1	205776	856875553	80	81
JH584298.1	184189	857084030	80	81
GL456354.1	195993	857270650	80	81
JH584299.1	953012	857469221	80	81
CM000999.3	149588044	858434228	80	81
CM001000.3	144995196	1009892205	80	81
GL456219.1	175968	1156699969	80	81
CM001001.3	130127694	1156878219	80	81
CM001002.3	124359700	1288632592	80	81
CM001003.3	130530862	1414546872	80	81
CM001004.3	121973369	1546709453	80	81
CM001005.3	120092757	1670207573	80	81
CM001006.3	120883175	1791801573	80	81
CM001007.3	125139656	1914195871	80	81
CM001008.3	104073951	2040899856	80	81
CM001009.3	98008968	2146274815	80	81
CM001010.3	95294699	18446744071660093299	80	81
CM001011.3	90720763	18446744071756579265	80	81
CM001012.3	61420004	18446744071848434121	80	81
CM001013.3	169476592	18446744071910621958	80	81
GL456233.2	559103	18446744072082217136	80	81
CM001014.3	91455967	18446744072082783310	80	81
JH584300.1	182347	18446744072175382599	80	81
JH584301.1	259875	18446744072175567348	80	81
JH584302.1	155838	18446744072175830594	80	81
JH584303.1	158099	18446744072175988502	80	81
GL456367.1	42057	18446744072176148684	80	81
GL456378.1	31602	18446744072176191373	80	81
GL456381.1	25871	18446744072176223477	80	81
GL456382.1	23158	18446744072176249778	80	81
GL456383.1	38659	18446744072176273332	80	81
GL456385.1	35240	18446744072176312581	80	81
GL456390.1	24668	18446744072176348368	80	81
GL456392.1	23629	18446744072176373452	80	81
GL456394.1	24323	18446744072176397484	80	81
GL456359.1	22974	18446744072176422219	80	81
GL456360.1	31704	18446744072176445588	80	81
GL456396.1	21240	18446744072176477796	80	81
GL456372.1	28664	18446744072176499409	80	81
GL456387.1	24685	18446744072176528539	80	81
GL456389.1	28772	18446744072176553640	80	81
GL456370.1	26764	18446744072176582879	80	81
GL456379.1	72385	18446744072176610085	80	81
GL456366.1	47073	18446744072176683482	80	81
GL456368.1	20208	18446744072176731251	80	81
JH584304.1	114452	18446744072176751819	80	81
MU069435.1	31129	18446744072176867809	80	81
AY172335.1	16299	18446744072176899400	80	81

This is for the mouse GRCm39 genome.

If I had to guess, I think we are overflowing a variable / incorrectly sign-extending a smaller integer into a 64-bit integer. The given is 0b1111111111111111111111111111111110100100101001011001010101001000 in binary, e.g. the 32 bit number 0b10100100101001011001010101001000 sign-extended into a 64 bit integer.

This is especially true because the last offset that worked is 2,146,274,815, suspiciously less than the maximum of an int32 ( 2,147,483,647)

Specifically in some C++ I've done in the past, Windows has this weird thing about treating long's as 32 bit numbers instead of 64 bit numbers. There might be a long hanging out somewhere instead of the bit-labelled types"uint32_t" and so on.

meson800 avatar Jun 14 '21 21:06 meson800