nilearn Boosting Neurovault downloader's performance

This issue is more of a report on investigations I did to improve the performance of the fetch_neurovault method.

Here are some sample downloading time:

1.5 minute for 4 small collections for a total of 20 images
12 hours to download IBC (~9500 images)
24 days to download all Neurovault (~460k images)

So I looked a bit into how we can improve those downloading times.

TLDR

parallelism of the requests works great and could be a simple option to have quick gains
using /collections/{collection_id}/download does not help as we need to also download each image metadata through the /api/collections/{collection_id}/images/{image_id} route

In details

Leveraging the `/collections/{collection_id}/download` API endpoint

This endpoint enables to download all images from a collection as a ZIP file. I tested it against the 4 collections for a total of 20 images: this quickens the download (55sec vs 98sec) as we don't have to make multiple requests to the Neurovault API which is the bottleneck, but it does not fetch the metadata associated with each image.

We could imagine using it for users who would only want the images and not the metadata, but I'm not sure this is a common thing? This would also raise the issue of failed download archive, which could be problematic for large collections as we don't want users to start over the download.

Parallel requests calls

I've tried 2 parallelization approaches to check if there is gain to be expected:

1. joblib + requests: simple requests made parallel using joblib
1. asyncio: async requests calls with a semaphore to keep the number of simultaneous calls limited

The first observation is that the parallelization works and there does not seem to be any network reduction appearing due to the parallelization (25s for 4 jobs vs 95s for 1 job).

Second is that asyncio does not offer a benefit in performance, but adds a lot of complexity (async python dependencies). Using joblib which is already a dependency seems to be the way for integrating this.

Code

All code is available at raphaelmeudec/neurovault-downloader.

Mar 28 '22 09:03 RaphaelMeudec

thanks for looking into this!

the zip download did not work for large collections last time I checked (~ 1 year ago), whether through the API or the website it stopped halfway through and reported an internal server error. + the metadata issue you mentionned
re. concurrent downloads, indeed I think using processes or threads with joblib will be easier than adding a dependency to aiohttp. But we need to make sure we don't overload the server; does the API documentation provide any explicit limits for the rate of requests or image downloads?

I am also wondering if a client for the neurovault API really has its place in nilearn. we could keep one function to fetch the specific image used in some examples (the fetch_neurovault_motor_task); but given the size of neurovault and the quirks of its API I think a full-fledged client is outside the scope of nilearn and would make more sense as a separate tool (possibly maintained by the maintainers of neurovault itself?)

Mar 28 '22 12:03 jeromedockes

re. concurrent downloads, indeed I think using processes or threads with joblib will be easier than adding a dependency to aiohttp. But we need to make sure we don't overload the server; does the API documentation provide any explicit limits for the rate of requests or image downloads?

I looked into Neurovault API website and Neurovault.org in general but can't find an answer to that question.

I am also wondering if a client for the neurovault API really has its place in nilearn. we could keep one function to fetch the specific image used in some examples (the fetch_neurovault_motor_task); but given the size of neurovault and the quirks of its API I think a full-fledged client is outside the scope of nilearn and would make more sense as a separate tool (possibly maintained by the maintainers of neurovault itself?)

There is actually a neurovault/pyneurovault that was inactive since 2017 until 5 days ago! I am not sure how well it works, there is an open issue about supporting Python 3, but a merged PR for it. I might go for some testing and write here the results

Apr 01 '22 15:04 RaphaelMeudec

Boosting Neurovault downloader's performance

TLDR

In details

Leveraging the /collections/{collection_id}/download API endpoint

Parallel requests calls

Code

Leveraging the `/collections/{collection_id}/download` API endpoint