nilearn icon indicating copy to clipboard operation
nilearn copied to clipboard

Boosting Neurovault downloader's performance

Open RaphaelMeudec opened this issue 4 years ago • 2 comments

This issue is more of a report on investigations I did to improve the performance of the fetch_neurovault method.

Here are some sample downloading time:

  • 1.5 minute for 4 small collections for a total of 20 images
  • 12 hours to download IBC (~9500 images)
  • 24 days to download all Neurovault (~460k images)

So I looked a bit into how we can improve those downloading times.

TLDR

  • parallelism of the requests works great and could be a simple option to have quick gains
  • using /collections/{collection_id}/download does not help as we need to also download each image metadata through the /api/collections/{collection_id}/images/{image_id} route

In details

Leveraging the /collections/{collection_id}/download API endpoint

This endpoint enables to download all images from a collection as a ZIP file. I tested it against the 4 collections for a total of 20 images: this quickens the download (55sec vs 98sec) as we don't have to make multiple requests to the Neurovault API which is the bottleneck, but it does not fetch the metadata associated with each image.

We could imagine using it for users who would only want the images and not the metadata, but I'm not sure this is a common thing? This would also raise the issue of failed download archive, which could be problematic for large collections as we don't want users to start over the download.

Parallel requests calls

I've tried 2 parallelization approaches to check if there is gain to be expected:

    1. joblib + requests: simple requests made parallel using joblib
    1. asyncio: async requests calls with a semaphore to keep the number of simultaneous calls limited

The first observation is that the parallelization works and there does not seem to be any network reduction appearing due to the parallelization (25s for 4 jobs vs 95s for 1 job).

Second is that asyncio does not offer a benefit in performance, but adds a lot of complexity (async python dependencies). Using joblib which is already a dependency seems to be the way for integrating this.

Code

All code is available at raphaelmeudec/neurovault-downloader.

RaphaelMeudec avatar Mar 28 '22 09:03 RaphaelMeudec

thanks for looking into this!

  • the zip download did not work for large collections last time I checked (~ 1 year ago), whether through the API or the website it stopped halfway through and reported an internal server error. + the metadata issue you mentionned
  • re. concurrent downloads, indeed I think using processes or threads with joblib will be easier than adding a dependency to aiohttp. But we need to make sure we don't overload the server; does the API documentation provide any explicit limits for the rate of requests or image downloads?

I am also wondering if a client for the neurovault API really has its place in nilearn. we could keep one function to fetch the specific image used in some examples (the fetch_neurovault_motor_task); but given the size of neurovault and the quirks of its API I think a full-fledged client is outside the scope of nilearn and would make more sense as a separate tool (possibly maintained by the maintainers of neurovault itself?)

jeromedockes avatar Mar 28 '22 12:03 jeromedockes

  • re. concurrent downloads, indeed I think using processes or threads with joblib will be easier than adding a dependency to aiohttp. But we need to make sure we don't overload the server; does the API documentation provide any explicit limits for the rate of requests or image downloads?

I looked into Neurovault API website and Neurovault.org in general but can't find an answer to that question.

I am also wondering if a client for the neurovault API really has its place in nilearn. we could keep one function to fetch the specific image used in some examples (the fetch_neurovault_motor_task); but given the size of neurovault and the quirks of its API I think a full-fledged client is outside the scope of nilearn and would make more sense as a separate tool (possibly maintained by the maintainers of neurovault itself?)

There is actually a neurovault/pyneurovault that was inactive since 2017 until 5 days ago! I am not sure how well it works, there is an open issue about supporting Python 3, but a merged PR for it. I might go for some testing and write here the results

RaphaelMeudec avatar Apr 01 '22 15:04 RaphaelMeudec