rentrez icon indicating copy to clipboard operation
rentrez copied to clipboard

Web history object - No esummary records found in file

Open d-caraballo opened this issue 2 years ago • 6 comments

Hi, I am trying to download all available bat coronaviruses. I used the following query: bat_cov_ids<-entrez_search(db="nuccore", term="Bat coronavirus", retmax = 10000) This returned 4214 hits, which I could access using entrez_summary coupled with the get-metadata function.

Now, I am trying to compare these results with a different search strategy. I want to seek for all coronavirus sequences in the "nuccore" database, and then filtering for bat hosts using the standarised taxonomy as in the tutorial.

I use the following code:

covs<-entrez_search(db="nuccore", term="txid11118[Organism]")

Which yields: Entrez search result with 4715446 hits (object contains 20 IDs and a web_history object) Search term (as translated): txid11118[Organism]

Then I use entrez_summary: entrez_summary(db="nuccore", web_history=covs$web_history)

And I get the message: Error during wrapup: No esummary records found in file

What is going wrong??

d-caraballo avatar Apr 27 '22 16:04 d-caraballo

Your trying to retrieve too many records and the only response from the server is "Too many UIDs in request. Maximum number of UIDs is 500 for JSON format output."

allenbaron avatar Apr 27 '22 16:04 allenbaron

Thanks, Allen. But the use of web_history wasn't precisely to avoid the "large request" problem? How can I get the complete record (4.7E6 hits!) and then filter by host species?

d-caraballo avatar Apr 27 '22 16:04 d-caraballo

I'm sorry to disappoint you but your going to have to do some extra work here if you want this to work. rentrez cannot handle this use case without extra coding.

Before you do anything else, I recommend you review the E-Utilities documentation, particularly where it discusses large requests in Usage Guidelines and Requirements.

rentrez does instantiate an Entrez History object when use_history = TRUE in entrez_search. An Entrez History object is basically required for large requests (> 200 records I think) but the Entrez Utilities still have limits on how many records you can retrieve in a single request. For ESummary the limit is dependent on the record format requested, 500 for json and 10,000 for xml (for more details about each Utility see The E-utilities In-Depth: Parameters, Syntax and More. To obtain more than that from a History object is possible but requires paging (see "Minimizing the Number of Requests" in the E-Utilities documentation; the Application 3 link provides an example of paging).

rentrez does not have the ability to page, so it will not work with the History object created. You could do this using the E-direct utilities on the command line, which I recommend if you are serious about getting this data. It might also be possible to get all the record IDs from entrez_search() and then request them in chunks of 10,000 with entrez_summary() but you should be aware that there is a bug in rentrez that prevents this from working (see PR #174). I fixed this specific issue in a fork when I realized rentrez is not being actively maintained.

One more thing for your consideration, the first 10,000 records of your request have a size of 221 MB.

allenbaron avatar Apr 27 '22 17:04 allenbaron

You seem to have the same problem as I have. I did find a way around this problem (at least it worked for me with pubmed). You can use an lapply or for loop, I included my code in issue #180.

LauraVP1994 avatar Jul 15 '22 08:07 LauraVP1994

Ideally, rentrez would be updated to implement E-utilties paging feature with a web history.

allenbaron avatar Jul 20 '22 14:07 allenbaron

Encountered the same issue:

rentrez::entrez_summary(db="gds", web_history=esearch$web_history)
# Esummary includes error message: Too many UIDs in request. Maximum number of UIDs is 500 for JSON format output. 

Which got more confusing when specifying retmode="XML" in a hope that this will rectify the problem:

rentrez::entrez_summary(db="gds", web_history=esearch$web_history, retmode="XML")
# Error in UseMethod("parse_esummary") : 
# no applicable method for 'parse_esummary' applied to an object of class "character"

Since documentation specifically says to use the web_history argument when the number of records is too large, it should be documented that it is not a panacea and how to work with a large number of records.

I will try to submit a PR once I figure out how to do it cleanly.

J-Moravec avatar Sep 29 '22 01:09 J-Moravec