go icon indicating copy to clipboard operation
go copied to clipboard

proposal: x/text/encoding: Handling Encoding Errors by Replacing Visually Similar Unicode Characters in ShiftJIS Encoding

Open yuki2006 opened this issue 1 year ago • 1 comments

Proposal Details

Summary

When encoding Unicode strings to Shift JIS in Go, certain visually similar characters cannot be directly represented in Shift JIS, leading to encoding errors. This causes confusion because the characters appear similar but result in errors during encoding. This proposal suggests introducing a normalization step that replaces these problematic characters with their Shift JIS-compatible equivalents before encoding. We accept that this transformation is one-way and that the original characters cannot be restored, which is acceptable for our use case.

Background

Shift JIS is a character encoding for the Japanese language but does not support all Unicode characters. Some visually similar characters have different code points and cannot be encoded in Shift JIS, causing encoding errors and confusion.

Examples:

The Unicode character "〜" (U+301C) looks similar to "~" (U+FF5E). The Unicode character "−" (U+2212) resembles the standard hyphen "-" (U+002D). These visually similar characters are often used interchangeably in text but may cause encoding errors when converting to Shift JIS. In our application, it is acceptable that the transformation is not reversible; we prioritize successful encoding over the ability to revert to the original characters.

Proposal

Introduce a normalization function that replaces visually similar Unicode characters, which cannot be encoded in Shift JIS, with their equivalent characters that can be encoded. This function can be integrated into the encoding process or provided as a utility in the golang.org/x/text/encoding/japanese package.

https://go.dev/play/p/OtEWoZmxDzb

package main

import (
	"fmt"

	"golang.org/x/text/encoding/japanese"
	"golang.org/x/text/transform"
)

func main() {
	replacements := map[string]string{
		"〜": "~", // U+301C (Wave Dash) → U+FF5E (Fullwidth Tilde)
		"−": "-", // U+2212 (Minus Sign) → U+002D (Hyphen-Minus)
		"—": "-", // U+2014 (Em Dash) → U+002D (Hyphen-Minus)
		"•": "*", // U+2022 (Bullet) → U+002A (Asterisk)
	}

	encoder := japanese.ShiftJIS.NewEncoder()

	for orig, replacement := range replacements {
		// Check if the original character can be encoded
		_, _, errOrig := transform.String(encoder, orig)
		// Check if the replacement character can be encoded
		_, _, errReplacement := transform.String(encoder, replacement)

		if errOrig == nil {
			fmt.Printf("Mapping may be unnecessary: Original character %q can be encoded.\n", orig)
		} else {
			fmt.Printf("Mapping necessary: Original character %q cannot be encoded: %v\n", orig, errOrig)
		}
		if errReplacement != nil {
			fmt.Printf("Warning: Replacement character %q cannot be encoded: %v\n", replacement, errReplacement)
		}
	}
}

Output

Mapping necessary: Original character "•" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "〜" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "−" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "—" cannot be encoded: encoding: rune not supported by encoding.

yuki2006 avatar Oct 18 '24 08:10 yuki2006

CC @mpvl

ianlancetaylor avatar Oct 18 '24 17:10 ianlancetaylor

If this is a wise approach, and it well may be, there should already be an official defining table for how to handle the translation. Go's implementation should not be the one to codify it.

robpike avatar Oct 20 '24 03:10 robpike

Thank you for your comment. Indeed, it might not be appropriate for the Go standard library to create a definition table.

In that case, would it be possible to identify which character (and at which position) failed to encode, and furthermore, allow us to specify a fallback when the conversion fails? (It might be convenient if we could specify a callback function, for example.)

https://go.dev/play/p/Jg6oE7cko4i

Postscript: It seems we can identify the location by using the return value n from transform.String.

yuki2006 avatar Oct 21 '24 04:10 yuki2006

Japanese versions of Windows still treat file names as Shift_JIS in some processes. This is not limited to Japanese, but is also the case in China and Korea, where Double Byte Character Sets are used. The Go language, which uses utf-8 as its internal encoding, has almost no problem when using the Windows wide character API to determine filenames, but when Go uses the command line to control specific filenames, it must handle Shift_JIS. In such cases, we want to use fallback characters to replace characters that only exist in UTF-8.

mattn avatar Oct 21 '24 04:10 mattn