SATySFi icon indicating copy to clipboard operation
SATySFi copied to clipboard

Proposal: I18n: Support non-English hyphenation dictionaries

Open na4zagin3 opened this issue 4 years ago • 1 comments

This proposal is to add support of hyphenation of non-English languages. This is the first step of supporting internationalization.

Proposal

  • Add a new type:
    • hyphen-dict Hyphenation pattern. Underlying OCaml representation is LoadHyph.t.
  • Add new primitives:
    • load-hyphen-dict : string -> hyphen-dict
    • set-hyphen-dict : hyphen-dict -> ctx -> ctx
    • get-hyphen-dict : ctx -> hyphen-dict
  • Use BCP 47 Language Tag or UTS#35 Language Identifier for filenames of hyphenation dictionary files.
    • The current hyphenation file english.satysfi-hyph needs to be renamed with en.satysfi-hyph.

load-hyphen-pattern language loads a hyphenation dictionary from hyph/<language>.satysfi-hyph. It raises an exception when the file is not found.

set-hyphen-pattern hyph ctx sets hyphnation pattern hyph to ctx.hyphenation_pattern.

get-hyphen-pattern ctx returns hyphnation pattern ctx.hyphenation_pattern.

Current Implementation

  • English hyphenation is located at lib-satysfi/dist/hyph/english.satysfi-hyph
  • english.satysfi-hyph is loaded at https://github.com/gfngfn/SATySFi/blob/1243829f9dcaf955e4ba0f5222a0f95b34e74e32/src/frontend/primitives.cppo.ml#L604
  • The only operation which sets hyphenation_dictionary is get_pdf_mode_initial_context at https://github.com/gfngfn/SATySFi/blob/1243829f9dcaf955e4ba0f5222a0f95b34e74e32/src/frontend/primitives.cppo.ml#L497

Alternative Options

Activate multiple hyphen-dicts at the same time

This proposal based on a design where users can replace English hyphenation pattern with other language's. It may be natural to set a hyphenation dictionary to each language/script (i.e., set-hyphen-dict : language-tag -> hyphen-dict > ctx -> ctx or set-hyphen-dict : hyphen-dict language-tag-map -> ctx -> ctx) rather than applying given hyphenation pattern globally, if we decide to extend the multi-language system, where English and Japanese are automatically detected with script types.

Introducing new type hyphen-dict

Instead of introducing hyphen-dict and having users explicitly handle hyphenation dictionaries, we could provide primitives get/set strings that represent languages (e.g., set-hyphen-dict : string -> ctx -> ctx).

However, hyphen-dict type allows more extension points (e.g., tweaking hyphenation patterns, adding exceptional words ad hoc) in future.

load-hyphen-dict throwing exceptions

load-hyphen-dict can have signature load-hyphen-dict : string -> hyphen-dict option. I don't have strong opinion about this. I was thinking of having a new package for each language, therefore specifying wrong filenames is unlikely.

Having a primitive to get available hyphenation dictionary files

I could include another primitive get-hyph-dict-list that returns available files under hyph/ (for example, returning [ "en" ]). This primitive is not mandatory.

Renaming english.satysfi-hyph for en.satysfi-hyph

We could leave the filename as is. However, considering even TeX has already adopted naming scheme with BCP 47 Language Tag, there is no reason to stick at traditional naming scheme with language names in English.

na4zagin3 avatar Apr 30 '20 06:04 na4zagin3

May I consider this proposal approved? If so, I’ll work on this after the refactoring is done.

na4zagin3 avatar Nov 22 '20 19:11 na4zagin3