openverse-api icon indicating copy to clipboard operation
openverse-api copied to clipboard

Store dead links in database and expose an API endpoint that the catalog can pull from

Open sarayourfriend opened this issue 3 years ago • 3 comments

Problem

Currently we store dead links only in Redis as a cache and periodically re-check the links based on a cache expiration. This means the work to find dead links on the API doesn't get reported back to the catalog for re-checking. We could have a DAG, for example, that pulls dead links from the API and then iterates through them to check the provider's API response for that item and determine whether any of the following are true:

  1. The item is truly gone from the provider
  2. The item's url has changed

Description

To enable this, we'd need to make it possible for the catalog to pull dead links from the API database. The easiest way to do this (I think) would be to store the dead links in the database and expose an endpoint for the catalog to ingest them and process them from. This endpoint could (probably should) be restricted to a specific OAuth scope so that it's not publicly available. It also seems likely that we'd want the catalog to be able to update the dead links list in the API. Either this would need to happen during ingestion or somehow we'd need to open a write endpoint on the API that the catalog can hit to fix the dead link in real-time somehow without having to wait for re-ingestion. That might not be necessary though :thinking: We would want to invalidate the cached dead link status from Redis upon the next data refresh of those items though which suggests either the catalog should write to Redis or somehow Redis needs to be updated during data refresh.

Alternatives

Could the catalog pull directly from Redis? That seems like a hack though.

Additional context

cc @stacimc and @AetherUnbound for expertise from the catalog side to know whether this is really possible or if there is another more sensible approach.

Implementation

  • [ ] 🙋 I would be interested in implementing this feature.

sarayourfriend avatar Aug 04 '22 15:08 sarayourfriend

This is a great idea! The catalog could potentially pull from Redis (as there's an associated provider). If the cache is a TTL though, it'd be better to have a more permanent record in the database. And we already have connections to the API database from Airflow, so we wouldn't need to set up a new connection in that case.

Flushing the Redis dead link cache during the data refresh would be feasible as well!

AetherUnbound avatar Aug 04 '22 18:08 AetherUnbound

I listened to a podcast about using Redis for more than just cache and was going to suggest pulling directly from Redis while reading the issue description.

We could make the cache for statuses such as 404 or 301 persistent so that those keys are only removed when we flush the Redis dead link cache during the data refresh. We could even save the dead links as a TSV in S3 until we have a process for removing the dead links from the upstream database.

obulat avatar Aug 09 '22 14:08 obulat

Another thing we could do if we wanted to persist them in Redis without changing too much is create a valid:history: prefix that gets a record inserted with status and date each time it is retrieved for a given result url.

Then yeah, Airflow could pull directly from Redis and be getting a full history of the dead links! That would be sweet.

The only issue though is that Redis does not allow querying by value as far as I understand. If we wanted the DAG to, for example, only query for certain time ranges (to only handle dead links that have been updated since the last run) then we wouldn't really be able to do so efficiently without using the exec command to filter results.

sarayourfriend avatar Aug 09 '22 14:08 sarayourfriend