pygbif icon indicating copy to clipboard operation
pygbif copied to clipboard

feature request: live-caching system

Open abubelinha opened this issue 2 months ago • 1 comments

Current caching system is only writing to disk when the script finishes. That could be improved (perhaps as an user-option) for some use cases under unreliable network connections:

I just came to pygbif today because requests was failing due to network issues at GBIF's side. I finally solved that problem, but in the mean time I decided to give pygbif a try, instead of using my own-made api requests. BTW I am not a requests-expert as you can see in that issue.

I was assuming these two things:

  • My silly requests loop has no caching system: if it fails (or if I stop it), when I re-run it has to redownload everything
  • I expected pygbif caching system to speed up that situation a lot. So if I the script gets killed by an error (or by user) and I re-run, I expected all cached requests should go fast (as described in #52) and the loop should quickly arrive to the point where it was killed in its previous run.

Now I am surprised because that is not happening.

I made a very simple test with this example file (510 names): https://www.gbif.org/tools/speciesMatching/advancedExample.csv I just used tqdm progress bar package combined with a for loop, and it reports total time and iterations per second (which in this case means names matched per second). Of course. below speeds may change a lot depending on network issues (today is a bad day, as per gbif-api#128).

  • First pygbif matching (2.96 it./s) is much slower than requests (12 to 15 it./s). That's not just because of writing things to cache.
    Even without setting caching(True) system!! (3.5 it./s) Of course, if the loop finishes (script not killed), then a second pygbif run is so fast: there is nothing to download (~180 it./s)
  • The dissapointing thing is when network errors occur: if the script is not finished then cache is not finally written to disk. So if I rerun the script, it goes exactly the same slow speed, about 4 times slower than using requests package.

Maybe there is already an option (that I don't know) to tell pygbif to "write cache right now"?

If not, it's worth trying to modify pygbif cache system so it optionally writes to disk after each api call. That would probably make it a bit slower. But no work would be lost if the script is finished for some reason before it ends normally.

The other thing to check is why pygbif is 4 times slower than requests, but that would be a different issue.

This is basically what I do in my loops:

for name in ['name1', 'name2' ... ]:
    dict = pygbif.species.name_backbone(name) # that returns a dictionary
    
for name in ['name1', 'name2' ... ]:
    r = requests.get("https://api.gbif.org/v1/species/match?name={}".format(name) )
    # then I parse json response to get a dictionary

Has anyone else noticed this difference in their speeds?

abubelinha avatar May 01 '24 18:05 abubelinha