openverse icon indicating copy to clipboard operation
openverse copied to clipboard

Some images have duplicate incorrectly decoded unicode tags

Open obulat opened this issue 2 years ago • 4 comments

Description

Some media with non-ascii characters in tags that were ingested a long time ago has duplicate tags: one with a correct utf-8 letter and one with an incorrectly escaped sequence.

Reproduction

  1. Go to https://api.openverse.engineering/v1/images/ab5150da-5d83-47ec-ad66-bf08dcfef78f/ (or https://search-production.openverse.engineering/image/ab5150da-5d83-47ec-ad66-bf08dcfef78f for the frontend view)
  2. Look at the tags list.
  3. See error: There are many unreadable tags, many of which are duplicated. For example, "arapça" with a ç with a cedilla, and arapu00e7a, where that character was replaced with an incorrectly escaped ç as u00e7 (this is the unicode code point for this letter, without the \ control character)
  4. This is the way they are saved in the catalog: {"name": "arapça", "provider": "flickr"}, {"name": "arapu00e7a", "provider": "flickr"},

Screenshots

Tags displayed on the frontend: Screenshot 2023-01-09 at 12 13 56 PM

Additional context

I think we also had the same problem for other details such as title and description, but most of them were fixed when re-ingested. When we upsert the tags, we add all the tags that are different from the ones already saved. And since the new tag appears different than the mangled one, both were saved.

This item has a non-mangled title and mangled and non-mangled tags, which suggests that the titles were fixed, and the tags were simply added to: https://api.openverse.engineering/v1/images/829eb0a7-3ce8-44ca-8194-4a78757a88aa/

There is also an error of over-correction of the unicode decoding error. Instead of removing the backslash before u, the backslash is escaped by another backslash, so arapu00e7a becomes arap\\u00e7a. On the frontend, we compensate for this problem for title, creator and tag name in decode-string: https://github.com/WordPress/openverse-frontend/blob/26fb744449cbe4c25b895c75fad57ab2646b1737/src/utils/decode-data.ts

obulat avatar Jan 09 '23 10:01 obulat