Infinite PubMed search for the term "lie detector"
Steps to reproduce:
- Enter the following terms on openknowledgemaps.org using the PubMed integration: lie detector
- Search does not complete, endless loading bar
Possible instance of #225
@sckott maybe you can have a look? On the side of our backend (pubmed.R) the search returns an error 500. the search was for "lie detector" with following parameters 1809-01-01 2017-12-04 most-recent.
The error message also contained following return:
Server Error
Your request could not be processed due to a problem on our Web server. This could be a transient problem, please try the query again. If it doesn't clear up within a reasonable period of time, e-mail a short description of your query and the diagnostic information shown below to:
[email protected] - for problems with PubMed
[email protected] - for problems with other services
Thank you for your assistance. We will try to fix the problem as soon as possible.
Diagnostic Information:
NOTE: The above is an internal URL which may differ from the one you used to address the page.
Rev. 01/04/08
will do!
@chreman what's the maximum limit that would go into the pubmed function? seems the user in the web UI can't set a limit. iternally do we always use the deafult of 100?
It looks like we can just go to a lower number requested per HTTP request, and just do many requests - but first curious about that limit value. Though making more requests may take more time - but it may be our only option, not sure
Yes, we always use the default limit of 100. At a first glance, I couldn't discern something that would make this query different from any other?
Okay thanks.
I don't see what makes it different either. I need to do more testing.
are there any other errors from pubmed in the logs we can use to narrow this down?
Sorry for the long delay - I also can't generate more information from the OKM-R backend without digging deeper into rentrez about which I don't know much, sorry.
@chreman can we replicate this error again at least?
@sckott yes, an error can be replicated with the following. I'm explicitly saying "an" error, because the error for the above-mentioned example has changed. it was a while since I had a look at this (sorry), so I suspect they made an API-change. Also, the returned error is now more specific and helpful, I hope.
x <- rentrez::entrez_search(db = "pubmed", term = "lie detector AND hasabstract", retmax = 100, mindate = "1809/01/01", maxdate = "2017/12/04", sort="", use_history=TRUE)
res <- rentrez::entrez_fetch(db = "pubmed", web_history = x$web_history, retmax = 100, rettype = "xml")
for which it will fail at the second step with
Error in entrez_check(response) :
HTTP failure: 502, bad gateway. This error code is often returned when trying to download many records in a single request. Try using web history as described in the rentrez tutorial
This is for rentrez 1.2.1
thanks @chreman i'll take a look next week, leaving for vaction tmrw
the timeout is 30 seconds on their end, and I haven't been able to change that, which makes sense
i propose splitting the 100 requested records into two requests, with two separate entrez_search queries and then two separate entrez_fetch queries. trying two entrez_fetch of the same entrez_search with 100 records doesn't work I think because if they use the same result from entrez_search then i think their server is trying to give back all results on each entrez_fetch call
library(rentrez)
search1 <- rentrez::entrez_search(db = "pubmed", term = "lie detector AND hasabstract", retmax = 50,
mindate = "1809/01/01", maxdate = "2017/12/04", sort="", use_history=TRUE)
search2 <- rentrez::entrez_search(db = "pubmed", term = "lie detector AND hasabstract", retmax = 50, retstart = 50,
mindate = "1809/01/01", maxdate = "2017/12/04", sort="", use_history=TRUE)
res1 <- rentrez::entrez_fetch(db = "pubmed", web_history = search1$web_history,
retmax = 50, rettype = "xml")
res2 <- rentrez::entrez_fetch(db = "pubmed", web_history = search2$web_history,
retmax = 50, rettype = "xml")
Then I think the results from the two calls can be combined with c()
xml <- c(xml2::xml_children(xml2::read_xml(res1)), xml2::xml_children(xml2::read_xml(res2)))
thoughts @chreman ?
Thanks for that proposal! We also have to make sure it works for the case when the result set is < 50 (e.g. when the timeframe is short, e.g. mindate = "2019/06/06"). I'll implement it and run a few tests to compare outputs and timings and will get back to you.
I had to adapt the solution slightly to accomodate our limit param, and this approach solves the immediate problem of running into server-side timouts.
limit1 = ifelse(limit <= 50, limit, 50)
limit2 = limit-50
search1 <- rentrez::entrez_search(db = "pubmed", term = query, retmax = limit1,
mindate = from, maxdate = to, sort=sortby, use_history=TRUE)
res1 <- rentrez::entrez_fetch(db = "pubmed", web_history = search1$web_history,
retmax = 50, rettype = "xml")
if (limit2 > 0) {
search2 <- rentrez::entrez_search(db = "pubmed", term = query, retmax = limit2, retstart = 50,
mindate = from, maxdate = to, sort=sortby, use_history=TRUE)
res2 <- rentrez::entrez_fetch(db = "pubmed", web_history = search2$web_history,
retmax = limit2, rettype = "xml")
xml <- c(xml2::xml_children(xml2::read_xml(res1)), xml2::xml_children(xml2::read_xml(res2)))
} else {
xml <- xml2::xml_children(xml2::read_xml(res1))
}
Unfortunately, with this implementation we're running into problems later on, specifically here
summary <- rentrez::entrez_summary(db="pubmed", web_history = x$web_history, retmax = limit)
df$readers <- extract_from_esummary(summary, "pmcrefcount")
df$readers <- replace(df$readers, df$readers=="", 0)
pmc_ids = c()
idlist = extract_from_esummary(summary, "articleids")
where we are adding readers and ids from a summary object, at which point we would need to be able to merge either the summary object two dataframes, which may become messy. Do you see any possibility to extract those datapoints from the overall XML object in here out <- lapply(xml, function(z) { as well?
If not, we'll have to think of a slightly larger rework.
@sckott I implemented a solution here
It solves the issue of responses that are too big for a single entrez_search. The resulting response object is ~100MB big for the query on "lie detector".
Unfortunately, the solution still fails, this time at entrez_summary, which would create an object 150MB big for yet unknown reasons.
Is there a way for only request the data we need ("pmcrefcount" and "articleids") when requesting the summary?
Is there a way for only request the data we need ("pmcrefcount" and "articleids") when requesting the summary?
i don't' know, but will look
- [ ] ask David Winter if rentrez allows returning only certain fields (it doesn't look like it's possible, but perhaps i'm missing something) - THIS IS NOT POSSIBLE:
David said you can not request specific fields