openverse
openverse copied to clipboard
Check crawled images have the correct URI protocol
Description
Old images from the ingestion through the common crawl process don't follow the recommended HTTPS protocol (or any at all) which is available in some cases. This is fixed in a step of the reingestion process but we want to delete this step moving forward, so we need to ensure this data is fixed in the upstream database.
Cleanup in the Ingestion server
https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L72-L92
https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L149-L158
Benefit
See WordPress/openverse#1663.
@krysal it seems like this will be handled by #4163 and the subsequent batched update using the output TSVs, is that correct? Is there any work that needs to be done on this explicitly?
I ran some queries just so we could get some perspective on the affected records:
openledger=> select count(*) from image where url not ilike 'http%';
count
-------
22157
(1 row)
openledger=> select count(*) from image where creator_url not ilike 'http%';
count
---------
9035590
(1 row)
openledger=> select count(*) from image where foreign_landing_url not ilike 'http%';
count
---------
8809342
(1 row)
@krysal it seems like this will be handled by #4163 and the subsequent batched update using the output TSVs, is that correct? Is there any work that needs to be done on this explicitly?
That's correct. It is a verification task after processing the files of cleaned data. We should wait for several ingestion workflows, and data refreshes to run to verify that new images have the fields filled out correctly.
The last DAG run for the image data refresh didn't produce new files and I checked querying upstream too:
openledger=> SELECT COUNT(*) FROM image WHERE creator_url NOT LIKE 'https://%' AND creator_url NOT LIKE 'http://%';
count
-------
0
(1 row)
openledger=> SELECT COUNT(*) FROM image WHERE foreign_landing_url NOT LIKE 'https://%' AND foreign_landing_url NOT LIKE 'http://%';
count
-------
0
(1 row)
openledger=> SELECT COUNT(*) FROM image WHERE url NOT LIKE 'https://%' AND url NOT LIKE 'http://%';
count
-------
0
(1 row)
This is done.