Steps to reproduce:

Enter the following terms on openknowledgemaps.org using the PubMed integration: lie detector
Search does not complete, endless loading bar

Jul 11 '18 20:07 pkraker

Possible instance of #225

Sep 14 '18 14:09 pkraker

@sckott maybe you can have a look? On the side of our backend (pubmed.R) the search returns an error 500. the search was for "lie detector" with following parameters 1809-01-01 2017-12-04 most-recent. The error message also contained following return:

NCBI/eutils202 - WWW Error 500 Diagnostic

Server Error

Your request could not be processed due to a problem on our Web server. This could be a transient problem, please try the query again. If it doesn't clear up within a reasonable period of time, e-mail a short description of your query and the diagnostic information shown below to:

[email protected] - for problems with PubMed
[email protected] - for problems with other services

Thank you for your assistance. We will try to fix the problem as soon as possible.

Diagnostic Information:

Error: 500

URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=xml&retmode=&retmax=100&WebEnv=NCID_1_433933_130.14.18.34_9001_1539979848_2094265568_0MetA0_S_MegaStore&query_key=1&email=david.winter40gmail.com&tool=rentrez

Client: 127.0.0.1

Server: eutils202

Time: Fri Oct 19 16:11:19 EDT 2018

NOTE: The above is an internal URL which may differ from the one you used to address the page.

Rev. 01/04/08

Oct 20 '18 14:10 chreman

will do!

Oct 22 '18 18:10 sckott

@chreman what's the maximum limit that would go into the pubmed function? seems the user in the web UI can't set a limit. iternally do we always use the deafult of 100?

Oct 23 '18 21:10 sckott

It looks like we can just go to a lower number requested per HTTP request, and just do many requests - but first curious about that limit value. Though making more requests may take more time - but it may be our only option, not sure

Oct 23 '18 21:10 sckott

Yes, we always use the default limit of 100. At a first glance, I couldn't discern something that would make this query different from any other?

Oct 23 '18 21:10 chreman

Okay thanks.

I don't see what makes it different either. I need to do more testing.

Oct 23 '18 22:10 sckott

are there any other errors from pubmed in the logs we can use to narrow this down?

Oct 23 '18 22:10 sckott

Sorry for the long delay - I also can't generate more information from the OKM-R backend without digging deeper into rentrez about which I don't know much, sorry.

Nov 29 '18 09:11 chreman

@chreman can we replicate this error again at least?

Nov 29 '18 20:11 sckott

@sckott yes, an error can be replicated with the following. I'm explicitly saying "an" error, because the error for the above-mentioned example has changed. it was a while since I had a look at this (sorry), so I suspect they made an API-change. Also, the returned error is now more specific and helpful, I hope.

x <- rentrez::entrez_search(db = "pubmed", term = "lie detector AND hasabstract", retmax = 100, mindate = "1809/01/01", maxdate = "2017/12/04", sort="", use_history=TRUE)
res <- rentrez::entrez_fetch(db = "pubmed", web_history = x$web_history, retmax = 100, rettype = "xml")

for which it will fail at the second step with

Error in entrez_check(response) : 
  HTTP failure: 502, bad gateway. This error code is often returned when trying to download many records in a single request.  Try using web history as described in the rentrez tutorial

This is for rentrez 1.2.1

Mar 10 '19 23:03 chreman

thanks @chreman i'll take a look next week, leaving for vaction tmrw

Mar 12 '19 19:03 sckott

the timeout is 30 seconds on their end, and I haven't been able to change that, which makes sense

Jun 12 '19 22:06 sckott

i propose splitting the 100 requested records into two requests, with two separate entrez_search queries and then two separate entrez_fetch queries. trying two entrez_fetch of the same entrez_search with 100 records doesn't work I think because if they use the same result from entrez_search then i think their server is trying to give back all results on each entrez_fetch call

library(rentrez)
search1 <- rentrez::entrez_search(db = "pubmed", term = "lie detector AND hasabstract", retmax = 50,
    mindate = "1809/01/01", maxdate = "2017/12/04", sort="", use_history=TRUE)
search2 <- rentrez::entrez_search(db = "pubmed", term = "lie detector AND hasabstract", retmax = 50, retstart = 50,
    mindate = "1809/01/01", maxdate = "2017/12/04", sort="", use_history=TRUE)
res1 <- rentrez::entrez_fetch(db = "pubmed", web_history = search1$web_history,
    retmax = 50, rettype = "xml")
res2 <- rentrez::entrez_fetch(db = "pubmed", web_history = search2$web_history,
    retmax = 50, rettype = "xml")

Then I think the results from the two calls can be combined with c()

xml <- c(xml2::xml_children(xml2::read_xml(res1)), xml2::xml_children(xml2::read_xml(res2)))

Jun 12 '19 22:06 sckott

thoughts @chreman ?

Jun 12 '19 22:06 sckott

Thanks for that proposal! We also have to make sure it works for the case when the result set is < 50 (e.g. when the timeframe is short, e.g. mindate = "2019/06/06"). I'll implement it and run a few tests to compare outputs and timings and will get back to you.

Jun 12 '19 23:06 chreman

I had to adapt the solution slightly to accomodate our limit param, and this approach solves the immediate problem of running into server-side timouts.

  limit1 = ifelse(limit <= 50, limit, 50)
  limit2 = limit-50
  search1 <- rentrez::entrez_search(db = "pubmed", term = query, retmax = limit1,
                                    mindate = from, maxdate = to, sort=sortby, use_history=TRUE)
  res1 <- rentrez::entrez_fetch(db = "pubmed", web_history = search1$web_history,
                                retmax = 50, rettype = "xml")
  if (limit2 > 0) {
    search2 <- rentrez::entrez_search(db = "pubmed", term = query, retmax = limit2, retstart = 50,
                                      mindate = from, maxdate = to, sort=sortby, use_history=TRUE)
    res2 <- rentrez::entrez_fetch(db = "pubmed", web_history = search2$web_history,
                                  retmax = limit2, rettype = "xml")
    xml <- c(xml2::xml_children(xml2::read_xml(res1)), xml2::xml_children(xml2::read_xml(res2)))
  } else {
    xml <- xml2::xml_children(xml2::read_xml(res1))
  }

Unfortunately, with this implementation we're running into problems later on, specifically here

  summary <- rentrez::entrez_summary(db="pubmed", web_history = x$web_history, retmax = limit)
  df$readers <- extract_from_esummary(summary, "pmcrefcount")
  df$readers <- replace(df$readers, df$readers=="", 0)

  pmc_ids = c()
  idlist = extract_from_esummary(summary, "articleids")

where we are adding readers and ids from a summary object, at which point we would need to be able to merge either the summary object two dataframes, which may become messy. Do you see any possibility to extract those datapoints from the overall XML object in here out <- lapply(xml, function(z) { as well? If not, we'll have to think of a slightly larger rework.

Jun 16 '19 20:06 chreman

@sckott I implemented a solution here It solves the issue of responses that are too big for a single entrez_search. The resulting response object is ~100MB big for the query on "lie detector". Unfortunately, the solution still fails, this time at entrez_summary, which would create an object 150MB big for yet unknown reasons. Is there a way for only request the data we need ("pmcrefcount" and "articleids") when requesting the summary?

Jun 24 '19 22:06 chreman

Is there a way for only request the data we need ("pmcrefcount" and "articleids") when requesting the summary?

i don't' know, but will look

Jun 25 '19 16:06 sckott

[ ] ask David Winter if rentrez allows returning only certain fields (it doesn't look like it's possible, but perhaps i'm missing something) - THIS IS NOT POSSIBLE:

David said you can not request specific fields

Nov 12 '19 17:11 sckott

Infinite PubMed search for the term "lie detector"

Server Error