pooch icon indicating copy to clipboard operation
pooch copied to clipboard

More informative error message when resolving DOI URLs

Open remrama opened this issue 11 months ago • 6 comments

Hi, we have an open PR for the YASA library at raphaelvallat/yasa#192 that transitions from test files being shipped with our repository to using Pooch to download them. Everything is working well and all tests are now passing (Python versions 3.9-3.12 on ubuntu, macos, and windows). However, the initial tests failed on 2 of the 12 systems (3.11 and 3.12 macos). The failing tests passed when we reran them after 24 hours without changing any code, so I presume it was a temporary connection- or server-related issue.

The error raised during tests was from L648-655 in downloaders.py, where Pooch tries to resolve the DOI, despite the DOI existing and passing in other tests.

ValueError: 'Archive with doi:10.5281/zenodo.14564284 not found (see https://zenodo.org/doi/10.5281/zenodo.14564284). Is the DOI correct?'

I have 2 questions:

  1. Have you had similar experiences when running CI tests with Pooch downloads? If so, do you have any general advice on how best to handle potential connection issues when downloading files during CI tests (e.g., caching)?
  2. Would it be possible to make that DOIDownloader error message more informative? Currently it clusters all status codes 400-600 into the assumption that the DOI was wrong. I think it would be helpful if the response code itself was provided in the error message to help the user determine why the DOI was not resolved (e.g., client vs server issues). I would be willing to submit a small PR if this makes sense to add.

remrama avatar Dec 29 '24 21:12 remrama

I encountered the same issue. I had to bypass the DOI and download the files manually. However, I'm not sure if this is a temporary issue related to the server or connection. I will try again tomorrow to confirm. I hope the developers will address and resolve this issue as soon as possible.

ophrys97 avatar Dec 31 '24 08:12 ophrys97

Hi @remrama. Thanks for opening this issue.

It's not rare to encounter failed downloads when running tests on CI. Most of the times this is related to server-side issues, usually because we trigger multiple downloads in parallel. My pragmatic approach is to split jobs in Actions so I can rerun the failed ones without having to rerun all the tests. Another approach would be to include some sleeps in between jobs, so they are not triggered at the same time, although tests will take a little bit longer to run (I haven't experimented with that).

  1. Would it be possible to make that DOIDownloader error message more informative? Currently it clusters all status codes 400-600 into the assumption that the DOI was wrong. I think it would be helpful if the response code itself was provided in the error message to help the user determine why the DOI was not resolved (e.g., client vs server issues). I would be willing to submit a small PR if this makes sense to add.

I totally agree that we should improve the error messages. Also, grouping together all codes from 400-600 is not the best approach. Printing out more information about the status code would be also great to troubleshoot errors. Should we create categories for different error codes? For example, I think status codes like 403, 404, 429, 502, 503, 504 could be filtered out and have personalized error messages for each one of them, while we can have a template for the other ones in range 400-600. Would love to hear ideas to improve this!

BTW, if anyone wants to work on this, please let us know! We can decide on some ideas and then anyone can open a PR to implement them.

santisoler avatar Jan 08 '25 19:01 santisoler

Thanks @santisoler, it's nice to hear that the CI errors are somewhat expected and that they don't cause major disruptions to your workflow. That makes me feel comfortable moving forward with them as-is in our own package. Maybe we'll consider workarounds to the minor issue in the future (like sleeping, thanks for the suggestion).

I think clustering the response codes into high-level categories makes sense, I suppose to make the message more informative to casual users. I can't think of a big reason to not at least include the individual error code each time (except maybe to avoid an overly-complex message for casual users). I was imagining a few high-level string templates that all take a code parameter. At least for me it would take a lot of the mystery out both during usage and during development.

remrama avatar Jan 22 '25 23:01 remrama

Note that my issues here might've been largely driven by the use of a DOIDownloader and/or load_registry_from_doi. I basically never passed all 12 system tests at once when I was grabbing our test data from a Zenodo repository using a DOI url and populating from that. But when I just saved the registry in a local file, no connection issues and tests passed (quicker as well).

This might not be surprising, since load_registry_from_doi will call the repository API to fill the registry, so of course this will involve more requests and more time. But I wanted to note it here that this difference can make-or-break in some use cases like mine.

remrama avatar Mar 05 '25 05:03 remrama

Is it possible there is more calls to the DOI site for resolving than necessary? I think the DOI is resolved to determine which repository (figshare, zenodo, etc.) to use every time fetch is called. This might not be an issue for local projects where a download doesn't happen much, but for my situation of unit tests it might be a burden.

If a pup's base_url is a DOI URL, then it makes a request to doi.org every time a file needs to be downloaded, because pup.get_url() will return a DOI URL (eg, doi:10.1234/zenodo.12345678/testfile.txt). Then choose_downloader() will return a DOIDownloader, and then DOIDownloader.__call__() will initiate a cascade of commands (DOIDownloader.__call__() --> doi_to_repository() --> doi_to_url() --> requests.get()) that involves a request to doi.org in order to get the name of the repository (figshare, zenodo, etc.), compile an HTTP URL, and ultimately use the HTTPDownloader.

Seems like a lot just to get the download URL, when I think those could be all stored/saved under the pup.urls attributed at once when pup.load_registry_from_doi() is called. In that case, you would only have to use the DOIDownloader once, build all the HTTP URLs, then from there on out the pup will just use HTTPDownloader whenever fetch() is called, which is much simpler and faster and uses less web requests.

I think this could be a primary issue for server-side failures: it's not from too many requests to the data repository, but from too many requests to the doi website for resolving. Still, it would be easier to tell with a more informative error message (original PR purpose).

remrama avatar Mar 05 '25 17:03 remrama

Hi @remrama! Yes, I agree that there are too many unnecessary requests when fetching multiple files from a DOI. All the requests to doi.org with the same doi to get the url for each file in the repository are totally unnecessary.

I like your approach to fix this. I'm not so happy that if we fill the registry manually, then that Pooch object will actually make multiple requests to doi.org and the DOI service provider for each file in the registry (even if they belong to the same DOI repo). But I think we can live with that for now.

So, for next steps, I think we can split these tasks into two PRs:

  • Produce more informative error messages after invalid request code.
  • Fill the urls dictionary when calling load_registry_from_doi so we avoid hitting doi.org and the DOI service provider after each file gets fetched.

santisoler avatar Mar 18 '25 19:03 santisoler