GenomeInfoDb error: seqlevelsStyle(annotations) <- 'UCSC'

Hi, I got error message:

annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86) seqlevelsStyle(annotations) <- 'UCSC' Error in find_NCBI_assembly_ftp_dir(assembly_accession, assembly_name = assembly_name) : unable to find FTP dir for assembly GCF_000001405.40 in https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/

thanks for you help!

Mar 14 '24 16:03 Kateahn

What's GetGRangesFromEnsDb? Where is this function defined? How am I supposed to reproduce this?

Please provide a reproducible example and show your sessionInfo().

Best

Mar 18 '24 18:03 hpages

Hi,

I'm encountering a similar error on a server that's behind an http proxy (the R proxy settings are configured accordingly). In our case, the problem is that in list_ftp_dir, the proxy server is coercing the ftp directory listing into an html, which the function doesn't parse appropriately.

For example, without the proxy, the following would give this output:

list_ftp_dir("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.20_GRCm38")
[1] "assembly_status.txt"                         
[2] "GCF_000001635.20_GRCm38_assembly_regions.txt"
[3] "GCF_000001635.20_GRCm38_assembly_report.txt" 
[4] "GCF_000001635.20_GRCm38_assembly_stats.txt"  
[5] "md5checksums.txt"                            
[6] "README.txt"

Behind an http proxy, however, it gives the following output (I'm skipping a lot of irrelevant lines here):

...
[137] "href=\"assembly_status.txt\">assembly_status.txt</a></td><td"                                                  
[138] "href=\"GCF_000001635.20_GRCm38_assembly_regions.txt\">GCF_000001635.20_GRCm38_assembly_regions.txt</a></td><td"
[139] "href=\"GCF_000001635.20_GRCm38_assembly_report.txt\">GCF_000001635.20_GRCm38_assembly_report.txt</a></td><td"  
[140] "href=\"GCF_000001635.20_GRCm38_assembly_stats.txt\">GCF_000001635.20_GRCm38_assembly_stats.txt</a></td><td"    
[141] "href=\"md5checksums.txt\">md5checksums.txt</a></td><td"                                                        
[142] "href=\"README.txt\">README.txt</a></td><td"                                                                    
[143] ""                                                                                                              
[144] ""                                                                                                              
...

It's not a very elegant solution, but changing the return line of list_ftp_dir to this fixes the problem:

gsub(".+\">|</a.+","",ans)

(this is from bioc release 3.18, but I'm pretty sure the same would be true of the latest version)

May 16 '24 14:05 plger

Hi Pierre-Luc, @plger

It seems like it is a consequence of the proxy? The code in GenomeInfoDb would have to be conditional to the proxy settings though it may not be reproducible across proxies (?). The workaround is to read the HTML and parse it with something like:

suppressPackageStartupMessages(library(rvest))
## insert your output in read_html
read_html("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.20_GRCm38/") |>
html_nodes("a") |>
html_text()
#> [1] "Parent Directory"                            
#> [2] "GCF_000001635.20_GRCm38_assembly_regions.txt"
#> [3] "GCF_000001635.20_GRCm38_assembly_report.txt" 
#> [4] "GCF_000001635.20_GRCm38_assembly_stats.txt"  
#> [5] "README.txt"                                  
#> [6] "assembly_status.txt"                         
#> [7] "md5checksums.txt"                            
#> [8] "HHS Vulnerability Disclosure"

^{Created on 2024-05-16 with reprex v2.1.0}

May 16 '24 14:05 LiNk-NY

Yes, it's a consequence of an http-only proxy (which is a pain, but not so uncommon either). I haven't tried, but I think a proxy that channels ftp as such (i.e. not passing it via http) would not get the output converted to html. So I think it'd be preferable to check if the output of getURL2 is an html or text, and parse accordingly, rather than conditioning on the proxy settings.

The exact html will differ a little according to the proxy (e.g. ours adds some text at the bottom), but I would think that the parsing you propose would work on all of them.

May 16 '24 15:05 plger

Already reported and discussed in issue #99 with no satisfying solution so far.

So I think it'd be preferable to check if the output of getURL2 is an html or text, and parse accordingly, rather than conditioning on the proxy settings.

Not really robust. This assumes that the organization running the FTP server runs an HTTP server on top of it, which is not always the case. Even when they do, there are many ways to configure this. For example in the case of NCBI they add the "HHS Vulnerability Disclosure" link that messes up the parsing. In the case of Ensembl (https://ftp.ensembl.org/pub/), they add the "Name", "Last modified", and "Size Description" links that also mess up the parsing. Etc...

Best course of action is to complain with your organization @plger that the http-only proxy breaks some basic bioinformatics tools and to kindly ask them to reconsider.

H.

May 16 '24 19:05 hpages

I agree, as an example, parsing the HTML page from ensembl would not be very robust:

suppressPackageStartupMessages(library(rvest))
html_table(read_html("https://ftp.ensembl.org/pub/"))[[1L]][-1L, c("Name", "Last modified", "Size")]
#> # A tibble: 136 × 3
#>    Name                    `Last modified`    Size 
#>    <chr>                   <chr>              <chr>
#>  1 Parent Directory        ""                 -    
#>  2 IPI/                    "2016-12-05 11:53" -    
#>  3 PRIVACY-NOTICE.pdf      "2018-07-26 14:04" 64K  
#>  4 VERSION                 "2024-05-13 15:49" 4    
#>  5 assembly/               "2016-12-05 11:53" -    
#>  6 assembly_mapping/       "2016-12-05 11:53" -    
#>  7 bamExample.bam          "2011-02-07 17:09" 2.2M 
#>  8 current/                "2024-05-14 11:33" -    
#>  9 current_README          "2024-05-14 16:37" 1.3K 
#> 10 current_assembly_chain/ "2024-04-22 15:34" -    
#> # ℹ 126 more rows

^{Created on 2024-05-16 with reprex v2.1.0}

May 16 '24 20:05 LiNk-NY

It's not like I haven't tried complaining with the university IT already :)

Personally I find a not-quite-robust solution that tends to work and sometimes fails is still better than it not working at all. But I won't complain further about really great packages, so as far as I'm concerned feel free to close the issue (though it wasn't mine in the first place).

May 17 '24 07:05 plger

Personally I find a not-quite-robust solution that tends to work and sometimes fails is still better than it not working at all.

Not from a package maintainer perspective. But you're welcome to submit a PR that works with NCBI, Ensembl, UCSC, and any other bioinformatics resource that you can think of. With unit tests of course.

May 22 '24 04:05 hpages

I am getting the following error:

seqlevelsStyle(annotation) <- "UCSC" Error in find_NCBI_assembly_ftp_dir(assembly_accession, assembly_name = assembly_name) : unable to find FTP dir for assembly GCF_000001405.40 in https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/

How do I resolve this issue?

May 28 '24 15:05 Urja25

@Urja25 Are you also behind an http-only proxy? Please show the output of list_ftp_dir("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"). FWIW here is what I get:

> library(GenomeInfoDb)
> list_ftp_dir("ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/")
 [1] "GCF_000001405.10_NCBI34"     "GCF_000001405.11_NCBI35"    
 [3] "GCF_000001405.12_NCBI36"     "GCF_000001405.13_GRCh37"    
 [5] "GCF_000001405.14_GRCh37.p2"  "GCF_000001405.17_GRCh37.p5" 
 [7] "GCF_000001405.21_GRCh37.p9"  "GCF_000001405.22_GRCh37.p10"
 [9] "GCF_000001405.23_GRCh37.p11" "GCF_000001405.24_GRCh37.p12"
[11] "GCF_000001405.25_GRCh37.p13" "GCF_000001405.26_GRCh38"    
[13] "GCF_000001405.27_GRCh38.p1"  "GCF_000001405.28_GRCh38.p2" 
[15] "GCF_000001405.29_GRCh38.p3"  "GCF_000001405.30_GRCh38.p4" 
[17] "GCF_000001405.31_GRCh38.p5"  "GCF_000001405.32_GRCh38.p6" 
[19] "GCF_000001405.33_GRCh38.p7"  "GCF_000001405.34_GRCh38.p8" 
[21] "GCF_000001405.35_GRCh38.p9"  "GCF_000001405.36_GRCh38.p10"
[23] "GCF_000001405.37_GRCh38.p11" "GCF_000001405.38_GRCh38.p12"
[25] "GCF_000001405.39_GRCh38.p13" "GCF_000001405.40_GRCh38.p14"
[27] "GCF_000001405.8_NCBI33"      "GCF_000001405.9_NCBI34"

Unfortunately if you are behind an http-only proxy the ouput of list_ftp_dir() will be contaminated with HTML which breaks higher-level functionalities that depend on it.

Have you reported this to your organization? Tell them that the http-only proxy breaks some basic bioinformatics tools and kindly ask them to reconsider. Alternatively you're welcome to submit a PR that makes list_ftp_dir() work over http-only proxies (see my previous post above).

Best, H.

May 31 '24 05:05 hpages

GenomeInfoDb GenomeInfoDb copied to clipboard

error: seqlevelsStyle(annotations) <- 'UCSC'

GenomeInfoDb
GenomeInfoDb copied to clipboard