openverse icon indicating copy to clipboard operation
openverse copied to clipboard

Check crawled images have the correct URI protocol

Open krysal opened this issue 3 years ago • 2 comments

Description

Old images from the ingestion through the common crawl process don't follow the recommended HTTPS protocol (or any at all) which is available in some cases. This is fixed in a step of the reingestion process but we want to delete this step moving forward, so we need to ensure this data is fixed in the upstream database.

Cleanup in the Ingestion server

https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L72-L92

https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L149-L158

Benefit

See WordPress/openverse#1663.

krysal avatar Oct 07 '22 16:10 krysal

@krysal it seems like this will be handled by #4163 and the subsequent batched update using the output TSVs, is that correct? Is there any work that needs to be done on this explicitly?

AetherUnbound avatar Apr 24 '24 20:04 AetherUnbound

I ran some queries just so we could get some perspective on the affected records:

openledger=> select count(*) from image where url not ilike 'http%';
 count
-------
 22157
(1 row)


openledger=> select count(*) from image where creator_url not ilike 'http%';
  count
---------
 9035590
(1 row)


openledger=> select count(*) from image where foreign_landing_url not ilike 'http%';
  count
---------
 8809342
(1 row)

AetherUnbound avatar Apr 24 '24 21:04 AetherUnbound

@krysal it seems like this will be handled by #4163 and the subsequent batched update using the output TSVs, is that correct? Is there any work that needs to be done on this explicitly?

That's correct. It is a verification task after processing the files of cleaned data. We should wait for several ingestion workflows, and data refreshes to run to verify that new images have the fields filled out correctly.

krysal avatar May 24 '24 13:05 krysal

The last DAG run for the image data refresh didn't produce new files and I checked querying upstream too:

openledger=> SELECT COUNT(*) FROM image WHERE creator_url NOT LIKE 'https://%' AND creator_url NOT LIKE 'http://%';
 count
-------
     0
(1 row)

openledger=> SELECT COUNT(*) FROM image WHERE foreign_landing_url NOT LIKE 'https://%' AND foreign_landing_url NOT LIKE 'http://%';
 count
-------
     0
(1 row)

openledger=> SELECT COUNT(*) FROM image WHERE url NOT LIKE 'https://%' AND url NOT LIKE 'http://%';
 count
-------
     0
(1 row)

This is done.

krysal avatar Aug 08 '24 19:08 krysal