openverse-api icon indicating copy to clipboard operation
openverse-api copied to clipboard

Consider hiding Flickr behind the 'mature' flag

Open zackkrida opened this issue 3 years ago • 8 comments
trafficstars

Problem

Currently, Flickr is the only source that has a risk of containing nudity or sexually explicit content that isn't properly identified. There is also rare gore (in the context of wilderness) from Wikimedia, that could be considered here.

Description

We could automatically hide these entire sources as mature until we have some kind of assurance that images are indeed, not mature. We might want to add an additional content report to mark an image as incorrectly labeled mature, to supplement this change.

Consequences

  • We would have many, many fewer results on all searches; Flickr currently makes up something like 2/3 of our data set.

zackkrida avatar Mar 24 '22 20:03 zackkrida

Is there an alternative approach like more aggressively sorting flickr results by some account metric? Like popularity + number of uploads + activity, etc? It'd be a shame to remove the largest repository of CC images from our search by default :disappointed:

sarayourfriend avatar Mar 25 '22 11:03 sarayourfriend

That's true, we could always devalue Flickr in the popularity calculations! That might be an easy way to emphasize other sources without removing Flickr from results entirely.

AetherUnbound avatar Mar 25 '22 16:03 AetherUnbound

I think treating flickr as a single "block" of results is also potentially restrictive. There are GLAM quality institutions that post on Flickr for example. If we could treat each Flickr account individually as a "source" (with a provider as Flickr) would that allow us more flexibility in being able to still promote very high quality Flickr results without punishing them for being on the same platform as lower quality or difficult-to-recommend content?

sarayourfriend avatar Mar 25 '22 17:03 sarayourfriend

If we could treat each Flickr account individually as a "source" (with a provider as Flickr) would that allow us more flexibility in being able to still promote very high quality Flickr results without punishing them for being on the same platform as lower quality or difficult-to-recommend content?

We do some of this already! And more would be an excellent way to surface the best content Flickr has to offer.

Here's where we define Flickr 'sub providers':

https://github.com/WordPress/openverse-catalog/blob/3034e31f6f204c641a89f7acbcbb85f026b9c9c3/openverse_catalog/dags/common/loader/provider_details.py#L43-L56

# Flickr parameters
FLICKR_SUB_PROVIDERS = {
    "nasa": {
        "24662369@N07",  # NASA Goddard Photo and Video
        "35067687@N04",  # NASA HQ PHOTO
        "29988733@N04",  # NASA Johnson
        "28634332@N05",  # NASA's Marshall Space Flight Center
        "108488366@N07",  # NASAKennedy
        "136485307@N06",  # Apollo Image Gallery
    },
    "bio_diversity": {"61021753@N02"},  # BioDivLibrary
    "spacex": {"130608600@N05"},  # Official SpaceX Photos
    "woc_tech": {"136629440@N06"},  # WOCinTech Chat
}

zackkrida avatar Mar 25 '22 17:03 zackkrida

Side note: It'd be awesome to have FLICKR_SUB_PROVIDERS in Airflow or the Django Admin so we could update this list dynamically and extract new subproviders in new DAG runs without deployment.

zackkrida avatar Mar 25 '22 17:03 zackkrida

If we did it in Django we could use Elastic's boosting query functionality to both promote accounts that have lots of high quality CC content and also demote problematic accounts that we don't want to hide entirely.

sarayourfriend avatar Mar 25 '22 17:03 sarayourfriend

Side note: It'd be awesome to have FLICKR_SUB_PROVIDERS in Airflow or the Django Admin so we could update this list dynamically and extract new subproviders in new DAG runs without deployment.

I think the current setup is OK, because (down the line) the DAGs will be refreshed every minute or every 5 minutes. Updates to DAGs will reach production pretty quickly! Based on how subproviders work, I'm not sure we'd be able to have them exclusively in Django. Suproviders are reflected in the source column of the catalog and the API I believe, we'd have to modify where we're storing that info.

AetherUnbound avatar Mar 30 '22 18:03 AetherUnbound

^ that's a great point! making DAG code changes much more frequent and easy is a way more flexible approach.

zackkrida avatar Mar 30 '22 19:03 zackkrida