pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

BCP 47 language escape sequences should support 3 characters

Open t-merz opened this issue 4 years ago • 10 comments

ISO 32000-2, 7.9.2.2.2 "Text string language escape sequences" contains the following text after NOTE 1:

The escape sequence shall consist of the following elements, in order: a) The Unicode value ESCAPE (U+001B) (that is, for strings encoded in UTF-16BE, the byte sequence 0 followed by 27; for strings encoded in UTF-8 the byte value 27). b) A 2- byte BCP 47 language code.

Item b) is wrong since BCP 47 is not restricted to 2-byte codes. In the unlikely case that this restriction is intentional it should be pointed out and explained as it would exclude all three-byte ISO 639-2 codes.

t-merz avatar Dec 07 '21 09:12 t-merz

I have a vague memory of discussing this long ago... I believe it is intentionally restricted to 2-bytes codes. @lrosenthol?

petervwyatt avatar Dec 07 '21 17:12 petervwyatt

Yes, the restriction is intentional and was primarily an attempt to move from "raw" ISO 639-1 to BCP-47.

lrosenthol avatar Jan 13 '22 20:01 lrosenthol

ISO 639-1 allows only two-letter codes, while BCP-47 additionally allows three-letter codes according to ISO 639-2 (plus other things).

Yes, the restriction is intentional and was primarily an attempt to move from "raw" ISO 639-1 to BCP-47.

What would be the reasoning behind moving from ISO 639-1 to the larger BCP-47 while at the same time restricting it to the previous ISO 639-1?

As a practical example, do you really want to lock out filipino users - "fil" in ISO 639-2, but not available in ISO 639-1?

t-merz avatar Jan 13 '22 22:01 t-merz

@t-merz it's entirely about backwards compatibility, since 32K-1 (and earlier) only supported two byte escape codes.

lrosenthol avatar Jan 14 '22 14:01 lrosenthol

@t-merz it's entirely about backwards compatibility, since 32K-1 (and earlier) only supported two byte escape codes.

Huh? ISO 32000-2 cannot introduce something new because it wasn't supported in ISO 32000-1? All of PDF 2.0 covers previous versions (except a few items which are explicitly marked as deprecated) and adds new features. In the same spirit PDF 2.0 can add support for three-byte language codes in addition to the two-byte codes which were supported in earlier versions.

And this isn't really brand-new stuff (https://en.wikipedia.org/wiki/ISO_639-2): "Work was begun on the ISO 639-2 standard in 1989, because the ISO 639-1 standard, which uses only two-letter codes for languages, is not able to accommodate a sufficient number of languages. The ISO 639-2 standard was first released in 1998."

Are you saying that filipino "fil" and many other languages cannot be expressed as document language in PDF 2.0 in the 2020s?

t-merz avatar Jan 14 '22 14:01 t-merz

ISO 32000-2, 7.9.2.2.2 "Text string language escape sequences" contains the following text after NOTE 1:

The escape sequence shall consist of the following elements, in order: a) The Unicode value ESCAPE (U+001B) (that is, for strings encoded in UTF-16BE, the byte sequence 0 followed by 27; for strings encoded in UTF-8 the byte value 27). b) A 2- byte BCP 47 language code.

Some more arguments why I believe the restriction on 2-byte codes in the last line above is wrong and should be corrected:

  • Note 2 in 7.9.2.2.2 mentions "The complete list of codes defined by BCP 47..." without any restriction on 2-letter ISO 639-1 codes.

  • More importantly, ISO 32000-1 also supports three-letter codes in 14.9.2.2 since it references RFC 3066 which lists both 2-letter ISO 639-1 codes as well as three-letter ISO 639-2 codes.

it's entirely about backwards compatibility, since 32K-1 (and earlier) only supported two byte escape codes.

Not true, see above. Keeping the restriction "A 2- byte BCP 47 language code." uncorrected in ISO 32000-2 would actually create an incompatibility with ISO 32000-1.

  • BCP 47 (used in ISO 32000-2) comprises RFC 4647 and RFC 5646, where the latter is the second-generation successor of RFC 3066 (used in ISO 32000-1). All of these support both two-character and three-character language codes.

t-merz avatar Jan 17 '22 16:01 t-merz

This needs ISO WG8 discussion - awaiting formation of JWGs

petervwyatt avatar May 19 '22 20:05 petervwyatt

PDF/UA TWG requested PDF TWG to discuss: PDF TWG acknowledge that this impacts UI strings that cannot use other mechanisms to express language (such as Lang keys) AND where the default document Lang is not used. Changing to 3 byte codes is viewed as a breaking change so too big for an errata. This needs ISO WG8 discussion - awaiting formation of JWGs.

petervwyatt avatar Jun 09 '22 20:06 petervwyatt

Here is an overview of languages that require 3 byte codes: https://www.loc.gov/standards/iso639-2/php/code_list.php

DietrichSeggern avatar Jun 10 '22 11:06 DietrichSeggern

PDF/UA need an outcome from WG8 November 2022 meetings. Betsy will add to the WG8 agenda.

petervwyatt avatar Sep 01 '22 20:09 petervwyatt