openverse-api
openverse-api copied to clipboard
Wrong message logged from validation of broken links
Description
The validation function is assuming every link is from an image so it logs something like this:
openverse-api-web-1 | [2022-05-27 18:30:49,758 - catalog.api.utils.validate_images - 89][INFO] Deleting broken image with ID 4cba47fc-1ba6-43e1-adda-646bf7c4e0ae from results.
But when verifying the identifier in the database is not found in the image table but in audio.
SELECT * FROM image WHERE identifier = '4cba47fc-1ba6-43e1-adda-646bf7c4e0ae';
--- Returns 0 rows
SELECT * FROM audio WHERE identifier = '4cba47fc-1ba6-43e1-adda-646bf7c4e0ae';
--- Returns 1 row
The following utility function should be generalized to apply to more media types (in the immediate case, for audio).
https://github.com/WordPress/openverse-api/blob/a76a6de6a2effd221d3486996a97bcb1370370d7/api/catalog/api/utils/validate_images.py#L13-L17
Reproduction
- Can be observed when running the tests for audio
just api-test -k audio_integration
- See the logs with
just logs web
- Confirm one of the identifiers running the previous queries.
Resolution
- [ ] 🙋 I would be interested in resolving this bug.
There are two tasks here.
-
Enhancing log Do we have a one to one mapping between ES index and media types? If yes, we can use ES index name in log for identifying database table.
-
Refactor
validate_imagesmodule Refactor validate_image module to include audio files as well. We can usevalidate_mediaas a better module name and make it more generic.
Hey @krysal I would like to take up this issue :)
@ritesh-pandey
Do we have a one to one mapping between ES index and media types?
That's correct, each media has its own ES index, and your suggestion of subdividing into two tasks sounds good.
@Mishrasubha Thanks for showing your interest. Please go ahead!