openverse-api
openverse-api copied to clipboard
Allow entire dataset to be downloaded en-masse
Description
Presently if users want our entire dataset, they must crawl through all possible searches in hopes of pulling up the results we have. We've discussed this in the past, but it would be ideal to have a bulk download option available for those who would like to use the entire dataset (e.g. iNaturalist's dataset: https://github.com/inaturalist/inaturalist-open-data)
This could be parquet or TSV files on S3 which have public accessibility, or some other means of pulling the entire dataset.
Implementation
- [ ] 🙋 I would be interested in implementing this feature.
i want to work on this feature @AetherUnbound @dhruvkb could you assign me this
Hi @MallikharjunaTeja! Thanks for offering your assistance 🙂 Before work proceeds on this, we need a plan fleshed out for what these bulk downloads would look like. How will the files be generated from our system? Would there need to be coordination with the Openverse Catalog, since we would likely need a scheduled DAG in order to run this? What fields and/or models would we include and exclude? I think this project will ultimately need an RFC written for it, you can find instructions and examples here: https://github.com/WordPress/openverse/tree/main/rfcs. The maintainers group currently doesn't have this slated for our near-term priorities, but if you would like to go ahead and give this a shot please feel free! We're happy to assist you and answer any questions you might have, particularly over in the Make WP Slack #openverse channel. Please let me know if you'd like to take on this work and I'll assign the issue to you.
Alternatively, we have a large number of issues across our repos which are marked as "good first issues". These issues were ones we felt it might be easy to jump into as a contributor. If you're looking to contribute to the project in general, I encourage you to take a look at the list here. We'd be happy to assign any one of those issues to you as well 😄