eutils icon indicating copy to clipboard operation
eutils copied to clipboard

Support large search result sets

Open reece opened this issue 10 years ago • 11 comments

Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/eutils #124 Migrated by bitbucket-issue-migration on 2016-05-25 23:09:02


NCBI's eutiltities interface very nicely supports large search result sets by sending results in chunks. The eutils currently only handles the first chunk.

See http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Demonstration_Programs Perl excerpt to generation the continuation URLs:

for($retstart = 0; $retstart < $Count; $retstart += $retmax) {
   my $efetch = "$utils/efetch.fcgi?" .
                "rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
                "db=$db&query_key=$QueryKey&WebEnv=$WebEnv";

The purpose of this issue is provide full support for large result sets using webenv histories.

Possible implementation: This seems like an obvious use of python iterators for results. I'd like to keep the eutils.xmlfacades.esearchresults.ESearchResults as parsing-only. However, the interface methods are appropriate. So, one implementation is to write an upper-level (eutils.esearchresults) that wraps the xmlfacade version, holds a reference to the client, and provides an iterator over results. This upper-level ESearchResults would be passed back to callers in lieu of the xmlfacade version.

reece avatar May 05 '15 23:05 reece

Is this still on the radar?

moritzschaefer avatar Oct 30 '18 15:10 moritzschaefer

It's certainly still desirable. No ETA. I'm happy to take a PR for this issue.

reece avatar Oct 30 '18 18:10 reece

Ah this is a real deal-breaker to an otherwise nice package! Although I am glad the package did show the following warning:

WARNING:eutils._internal.client:NCBI found 13241 results, but we truncated the reply at 250 results; see https://github.com/biocommons/eutils/issues/124/

If it is any guidance, a few years ago I made this implementation to deal with the pagination. Anyways, I don't think I'll have the time soon to make a PR with this contribution, but will keep it on my radar.

dhimmel avatar Sep 10 '19 00:09 dhimmel

how do we shut off the warnings? warnings.simplefilter("ignore") is not effective

leipzig avatar Feb 14 '20 15:02 leipzig

That command suppresses warnings made through the warnings module.

The messages that you're seeing are warnings made through the logging module. There really no way to suppress those specifically.

If you're running from the command line, the best/easiest workaround is probably to redirect stderr to a separate file (or /dev/null).

reece avatar Feb 17 '20 03:02 reece

Is this repo still maintained?

PazBazak avatar Dec 27 '22 21:12 PazBazak

I don't need eutils in my work at the moment, so I'm not adding new features or fixing bugs. But, I will gladly accept PRs if you have something to contribute.

reece avatar Dec 27 '22 23:12 reece

I tried to add a costume variable "retstart" and "retmax" to create a loop and getting the results by looping through my search pubmed ids. After five hour, still couldn't make it, but I am sure that we can add retstart and retmax as a costume variable. In VS code, you need to ctrl+click on the xx.esearch to see the code behind that which is:

def esearch(self, db, term, retmax=250, retstart=0):
    """query the esearch endpoint
    """
    esr = ESearchResult(self._qs.esearch({"db": db, "term": term}, retmax=retmax, retstart=retstart))


    if esr.count > retmax:
        logger.warning("NCBI found {esr.count} results, but we truncated the reply at {esr.retmax}"
                    " results; see https://github.com/biocommons/eutils/issues/124/".format(esr=esr))
    return esr

And you can ctrl+click on ESearchResult to see the code behind that which is: class ESearchResult(Base): #def init(self, xml_string, retmax=250, retstart=0): #self._xml_root = ET.fromstring(xml_string) #self._retmax = retmax #self._retstart = retstart

_root_tag = "eSearchResult"

@property
def count(self):
    return int(self._xml_root.find("Count").text)

@property
def retmax(self):
    return int(self._xml_root.find("RetMax").text)

#@retmax.setter
#def retmax(self, value):
    #self._retmax = value
    #self._xml_root.find("RetMax").text = str(value)

@property
def retstart(self):
    return int(self._xml_root.find("RetStart").text)

#@retstart.setter
#def retstart(self, value):
    #self._retstart = value
    #self._xml_root.find("RetStart").text = str(value)

@property
def ids(self):
    return [int(id) for id in self._xml_root.xpath("/eSearchResult/IdList/Id/text()")]

@property
def webenv(self):
    try:
        return self._xml_root.find("WebEnv").text
    except AttributeError:
        return None

You can see my code trying to set retmax and retstart as a modifiable variable, hoping to download a big chunk of articles looping through pubmed results:

while i <= count:
    ai = ec.esearch(db='pubmed', term=search_term, retmax=400, retstart=i)
    i += 400

I hope someone with more experience can put 1 hour into this and solve this issue, which will help so many people like me :) Cheers to this future hero :)

Sdamirsa avatar Oct 06 '23 16:10 Sdamirsa

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jan 05 '24 01:01 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Jan 12 '24 01:01 github-actions[bot]

Just hit this issue myself -- I'm reopening this issue and will get a PR up... sometime.

jsstevenson avatar Apr 04 '24 13:04 jsstevenson