openverse
openverse copied to clipboard
Some images have duplicate incorrectly decoded unicode tags
Description
Some media with non-ascii characters in tags that were ingested a long time ago has duplicate tags: one with a correct utf-8 letter and one with an incorrectly escaped sequence.
Reproduction
- Go to https://api.openverse.engineering/v1/images/ab5150da-5d83-47ec-ad66-bf08dcfef78f/ (or https://search-production.openverse.engineering/image/ab5150da-5d83-47ec-ad66-bf08dcfef78f for the frontend view)
- Look at the tags list.
- See error: There are many unreadable tags, many of which are duplicated. For example,
"arapça"
with aç
with acedilla
, andarapu00e7a
, where that character was replaced with an incorrectly escapedç
asu00e7
(this is the unicode code point for this letter, without the\
control character) - This is the way they are saved in the catalog:
{"name": "arapça", "provider": "flickr"}, {"name": "arapu00e7a", "provider": "flickr"},
Screenshots
Tags displayed on the frontend:
Additional context
I think we also had the same problem for other details such as title and description, but most of them were fixed when re-ingested. When we upsert the tags, we add all the tags that are different from the ones already saved. And since the new tag appears different than the mangled one, both were saved.
This item has a non-mangled title and mangled and non-mangled tags, which suggests that the titles were fixed, and the tags were simply added to: https://api.openverse.engineering/v1/images/829eb0a7-3ce8-44ca-8194-4a78757a88aa/
There is also an error of over-correction of the unicode decoding error. Instead of removing the backslash before u
, the backslash is escaped by another backslash, so arapu00e7a
becomes arap\\u00e7a
.
On the frontend, we compensate for this problem for title, creator and tag name in decode-string
: https://github.com/WordPress/openverse-frontend/blob/26fb744449cbe4c25b895c75fad57ab2646b1737/src/utils/decode-data.ts