pre-commit-hooks icon indicating copy to clipboard operation
pre-commit-hooks copied to clipboard

`require-ascii` doesn’t do what it says on the tin

Open Jayman2000 opened this issue 3 years ago • 0 comments

According to the README:

require-ascii

What it does

Requires that text files have ascii-encoding, including the extended ascii set. This is useful to detect files that have unicode characters.

require-ascii will fail on files that are encoded in extended ASCII if:

  1. the file uses characters in the 128–255 range, and
  2. those characters aren’t followed by other characters that coincidentally make the sequence valid UTF-8 (see this table).

This script will generate a bunch of files that contain valid extended ASCII but fail when tested by require-ascii:

# The README links to <https://theasciicode.com.ar/>. There's many different
# ways you could extend ASCII, but that site in particular says "In 1981,
# IBM developed an extension of 8-bit ASCII code, called 'code page 437'..."
extended_ascii = "cp437"

for code_point in range(128, 256):
	# Create a file that should pass require-ascii, but won't.
	with open(f"{code_point}.cp437.txt", mode='wb') as file:
		file.write(code_point.to_bytes(1, 'little'))
	# Make sure that that file really does contain valid extended ASCII.
	with open(f"{code_point}.cp437.txt", mode='rt', encoding=extended_ascii) as file:
		# This should cause a UnicodeDecodeError if file contains
		# invalid extended ASCII.
		file.read()

A more accurate description of require-ascii would be:

require-ascii

What it does

Requires that text files use UTF-8 and only use code points ≤ 255.

Jayman2000 avatar Apr 25 '22 20:04 Jayman2000