openverse
openverse copied to clipboard
Some images have duplicate incorrectly decoded unicode tags
Description
Some media with non-ascii characters in tags that were ingested a long time ago has duplicate tags: one with a correct utf-8 letter and one with an incorrectly escaped sequence.
Reproduction
- Go to https://api.openverse.engineering/v1/images/ab5150da-5d83-47ec-ad66-bf08dcfef78f/ (or https://search-production.openverse.engineering/image/ab5150da-5d83-47ec-ad66-bf08dcfef78f for the frontend view)
- Look at the tags list.
- See error: There are many unreadable tags, many of which are duplicated. For example,
"arapça"with açwith acedilla, andarapu00e7a, where that character was replaced with an incorrectly escapedçasu00e7(this is the unicode code point for this letter, without the\control character) - This is the way they are saved in the catalog:
{"name": "arapça", "provider": "flickr"}, {"name": "arapu00e7a", "provider": "flickr"},
Screenshots
Tags displayed on the frontend:

Additional context
I think we also had the same problem for other details such as title and description, but most of them were fixed when re-ingested. When we upsert the tags, we add all the tags that are different from the ones already saved. And since the new tag appears different than the mangled one, both were saved.
This item has a non-mangled title and mangled and non-mangled tags, which suggests that the titles were fixed, and the tags were simply added to: https://api.openverse.engineering/v1/images/829eb0a7-3ce8-44ca-8194-4a78757a88aa/
There is also an error of over-correction of the unicode decoding error. Instead of removing the backslash before u, the backslash is escaped by another backslash, so arapu00e7a becomes arap\\u00e7a.
On the frontend, we compensate for this problem for title, creator and tag name in decode-string: https://github.com/WordPress/openverse-frontend/blob/26fb744449cbe4c25b895c75fad57ab2646b1737/src/utils/decode-data.ts
We may be able to do some analysis on the tags to determine the provider or range of dates this is limited to!
This will be solved by https://github.com/WordPress/openverse/issues/4452 and there are no ways to address this that do not run into the exact same problems of "inferring" unescaped unicode as exist for that issue. I do not believe there is anything unique to do for this issue aside from just continuing the work on https://github.com/WordPress/openverse/issues/4452.
@AetherUnbound and @obulat if y'all agree with that (please let me know if not), feel free to close this as won't do; I didn't want to close it without your explicit input in case I have missed something that makes this unrelated to the issue I linked.
I have seen this and will respond to this discussion when I have time!