openverse-api
openverse-api copied to clipboard
Consider hiding Flickr behind the 'mature' flag
Problem
Currently, Flickr is the only source that has a risk of containing nudity or sexually explicit content that isn't properly identified. There is also rare gore (in the context of wilderness) from Wikimedia, that could be considered here.
Description
We could automatically hide these entire sources as mature until we have some kind of assurance that images are indeed, not mature. We might want to add an additional content report to mark an image as incorrectly labeled mature, to supplement this change.
Consequences
- We would have many, many fewer results on all searches; Flickr currently makes up something like 2/3 of our data set.
Is there an alternative approach like more aggressively sorting flickr results by some account metric? Like popularity + number of uploads + activity, etc? It'd be a shame to remove the largest repository of CC images from our search by default :disappointed:
That's true, we could always devalue Flickr in the popularity calculations! That might be an easy way to emphasize other sources without removing Flickr from results entirely.
I think treating flickr as a single "block" of results is also potentially restrictive. There are GLAM quality institutions that post on Flickr for example. If we could treat each Flickr account individually as a "source" (with a provider as Flickr) would that allow us more flexibility in being able to still promote very high quality Flickr results without punishing them for being on the same platform as lower quality or difficult-to-recommend content?
If we could treat each Flickr account individually as a "source" (with a provider as Flickr) would that allow us more flexibility in being able to still promote very high quality Flickr results without punishing them for being on the same platform as lower quality or difficult-to-recommend content?
We do some of this already! And more would be an excellent way to surface the best content Flickr has to offer.
Here's where we define Flickr 'sub providers':
https://github.com/WordPress/openverse-catalog/blob/3034e31f6f204c641a89f7acbcbb85f026b9c9c3/openverse_catalog/dags/common/loader/provider_details.py#L43-L56
# Flickr parameters
FLICKR_SUB_PROVIDERS = {
"nasa": {
"24662369@N07", # NASA Goddard Photo and Video
"35067687@N04", # NASA HQ PHOTO
"29988733@N04", # NASA Johnson
"28634332@N05", # NASA's Marshall Space Flight Center
"108488366@N07", # NASAKennedy
"136485307@N06", # Apollo Image Gallery
},
"bio_diversity": {"61021753@N02"}, # BioDivLibrary
"spacex": {"130608600@N05"}, # Official SpaceX Photos
"woc_tech": {"136629440@N06"}, # WOCinTech Chat
}
Side note: It'd be awesome to have FLICKR_SUB_PROVIDERS in Airflow or the Django Admin so we could update this list dynamically and extract new subproviders in new DAG runs without deployment.
If we did it in Django we could use Elastic's boosting query functionality to both promote accounts that have lots of high quality CC content and also demote problematic accounts that we don't want to hide entirely.
Side note: It'd be awesome to have
FLICKR_SUB_PROVIDERSin Airflow or the Django Admin so we could update this list dynamically and extract new subproviders in new DAG runs without deployment.
I think the current setup is OK, because (down the line) the DAGs will be refreshed every minute or every 5 minutes. Updates to DAGs will reach production pretty quickly! Based on how subproviders work, I'm not sure we'd be able to have them exclusively in Django. Suproviders are reflected in the source column of the catalog and the API I believe, we'd have to modify where we're storing that info.
^ that's a great point! making DAG code changes much more frequent and easy is a way more flexible approach.