bold icon indicating copy to clipboard operation
bold copied to clipboard

Trouble downloading specimen + sequence data from BOLD with bold_seqspec() in case of either improper quoting or missing fields

Open mdogniez opened this issue 9 months ago • 6 comments

Hi,

I'm starting to work on a DNA metabarcoding project and I was following the amazing tutorial from Devon O'rourke to build my COI reference library from BOLD (https://forum.qiime2.org/t/building-a-coi-database-from-bold-references/16129) for the QIIME2 pipeline.

When dowloading data with the bold_seqspec() function, my progress was halted several times by the two following errors :

> other_acti_list <- lapply(other_acti_names, bold_seqspec)
Avis : Found and resolved improper quoting out-of-sample. First healed line 202543: <<HEEN006-18	AI-1812.1	8741765		AI-1812	University of Colorado, Boulder		BOLD:ADM2338	18	Chordata	77	Actinopterygii	243	Cypriniformes	775028	Botiidae	86731	Botiinae	106096	Botia					Jake Lowenstein							Richard and Jake				"Salween River, Thai/Myanmar Border" Aqua Imports, Boulder, CO [email protected]			Adult																Thailand	Mae Hong Son	Thai/Myanmar Border		Salween River	3312717|3312718|3312716	http://www.boldsystems.org/pics/HEEN/EBIO_4460_2018_IMG_24362018-05-07+1526073900.J>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
> other_acti_list <- lapply(other_acti_names, bold_seqspec)
Avis : Stopped early on line 213843. Expected 80 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<<style type="text/css">>>

My intial (and rudimentary) solution was to just exclude manually the problematic sequence records, as most of them are not relevant for my study anyway. However, when progressing through the different phyla contained in the BOLD database, I realised that these errors were way to frequent to keep doing that manually.

Would there be a way to go past these errors, so that I can proceed with an automatic download of all my sequences ?

Thanks in advance !

PS: sorry if it's a naive question, I'm very new on this topic, and in bioinformatics in general

Session Info
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=French_Belgium.utf8  LC_CTYPE=French_Belgium.utf8    LC_MONETARY=French_Belgium.utf8
[4] LC_NUMERIC=C                    LC_TIME=French_Belgium.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_1.4.3   tibble_3.2.1   tidyr_1.3.0    refdb_0.1.1    dplyr_1.1.3    taxize_0.9.100 bold_1.3.0    

loaded via a namespace (and not attached):
 [1] pkgload_1.3.3      jsonlite_1.8.7     foreach_1.5.2      shiny_1.7.5        triebeard_0.4.1    urltools_1.7.3    
 [7] cellranger_1.1.0   remotes_2.4.2.1    yaml_2.3.7         ggrepel_0.9.5      sessioninfo_1.2.2  pillar_1.9.0      
[13] lattice_0.20-45    glue_1.6.2         uuid_1.1-1         digest_0.6.31      promises_1.2.0.1   colorspace_2.1-0  
[19] cowplot_1.1.3      htmltools_0.5.4    httpuv_1.6.11      pkgconfig_2.0.3    devtools_2.4.5     ggspatial_1.1.9   
[25] httpcode_0.3.0     purrr_1.0.1        xtable_1.8-4       scales_1.2.1       processx_3.8.2     later_1.3.1       
[31] proxy_0.4-27       generics_0.1.3     ggplot2_3.4.3      usethis_2.2.2      ellipsis_0.3.2     cachem_1.0.6      
[37] withr_2.5.0        cli_3.6.0          magrittr_2.0.3     crayon_1.5.2       mime_0.12          ps_1.7.2          
[43] memoise_2.0.1      evaluate_0.21      fs_1.5.2           fansi_1.0.4        nlme_3.1-160       xml2_1.3.5        
[49] class_7.3-20       pkgbuild_1.4.2     profvis_0.3.8      prettyunits_1.2.0  tools_4.2.2        data.table_1.14.8 
[55] lifecycle_1.0.3    stringr_1.5.0      munsell_0.5.0      callr_3.7.3        compiler_4.2.2     e1071_1.7-13      
[61] rlang_1.1.1        classInt_0.4-10    units_0.8-4        grid_4.2.2         conditionz_0.1.0   iterators_1.0.14  
[67] rstudioapi_0.15.0  htmlwidgets_1.6.2  miniUI_0.1.1.1     rmarkdown_2.25     gtable_0.3.4       codetools_0.2-18  
[73] DBI_1.1.3          curl_4.3.3         R6_2.5.1           zoo_1.8-12         knitr_1.44         fastmap_1.1.1     
[79] utf8_1.2.3         KernSmooth_2.23-20 ape_5.7-1          stringi_1.7.8      parallel_4.2.2     crul_1.4.0        
[85] Rcpp_1.0.11        vctrs_0.6.2        sf_1.0-16          urlchecker_1.0.1   tidyselect_1.2.0   xfun_0.40         
[91] coda_0.19-4

mdogniez avatar May 08 '24 14:05 mdogniez