webpub-manifest icon indicating copy to clipboard operation
webpub-manifest copied to clipboard

Default language for.. say metadata.title?

Open jccr opened this issue 4 years ago • 14 comments

Given I have a publication with a title in metadata like this:

{
  "metadata": {
    "title": {
      "fr": "Vingt mille lieues sous les mers",
      "en": "Twenty Thousand Leagues Under the Sea",
      "ja": "海底二万里"
    }
  }
}

What would the default language be? If all I want is just any string, without having a localization preference. Would it be the first in the "list", i.e. the value of "fr"?

If so.. the order of the keys might be a problem.

jccr avatar Mar 19 '20 20:03 jccr

IMO there is no default language embedded in the publication. There is instead a preferred language (or a list of) in the reading app.

llemeurfr avatar Mar 19 '20 20:03 llemeurfr

Im looking at this from the Shared Models API perspective.

Trying to deal with two types of data, for example in typescript:

interface LocalizedString {
  [key: string]: string
}

interface Metadata {
  title: string | LocalizedString
}

When I want to grab a value from title... I have to deal with the union type first with some "unwrapping" code that IMO is too cumbersome.

Ideally I think I want this:

interface Metadata {
  title: LocalizedString
}

Where all data is normalized to that structure.

{
  "metadata": {
    "title": {
      "fr": "Vingt mille lieues sous les mers",
      "en": "Twenty Thousand Leagues Under the Sea",
      "ja": "海底二万里"
    }
  }
}

would work fine as is, and would fit into LocalizedString nicely.

But.. what about the case if the data is just a simple bare string? Like this:

{
  "metadata": {
    "title": "Twenty Thousand Leagues Under the Sea"
  }
}

In my interface design it would end up being parsed like this:

title = {
  "": "Twenty Thousand Leagues Under the Sea"
}

Still ugly.. but it's normalized (is it better? I'm asking myself)

jccr avatar Mar 19 '20 20:03 jccr

Alright, my thinking is now I'm moving towards your suggestion @llemeurfr

jccr avatar Mar 19 '20 22:03 jccr

Still would like to draft up a design for a convenient API though, and IMO It's easier with normalization of the data.

jccr avatar Mar 19 '20 22:03 jccr

@jccr have you looked at the APIs in the Swift version?

HadrienGardeur avatar Mar 19 '20 22:03 HadrienGardeur

@HadrienGardeur I have actually. I'll go back and iterate my thoughts on that approach too.

jccr avatar Mar 19 '20 22:03 jccr

Actually the Kotlin version is more up-to-date now. But thank you for raising this issue, we improvised a bit there when this should be specified and shared among platforms.

Here's how it works on Kotlin:

  • We normalize the JSON to a LocalizedString object holding a Map<String?, Translation>.
    • LocalizedString.Translation only contains a String for now, but could be extended to support text direction, for example.
  • If we don't know the language, then the key can be null (e.g. when parsing a RWPM). But with EPUB, we try to use the xml:lang element, or fallback on the publication's language (@qnga might chime in on this).
  • When serializing LocalizedString to JSON, if a key is null then we use the BCP-47 language code und, which is made for that.

    The 'und' (Undetermined) primary language subtag identifies linguistic content whose language is not determined. IETF

  • In the shared model, we decided to offer a simple API, considering that most reading apps won't care about the translations (the test app doesn't use it, for example). Therefore, for Metadata.title, we actually have two properties:
    • localizedTitle which is the LocalizedString object.
    • title which is an alias to localizedTitle.string.
    • This choice was also guided by the need to stay backward-compatible with the previous API.

Here's the API of LocalizedString:

  • (property) translations: Map<String?, Translation>
    • Provides a direct access to the translations map.
  • getOrFallback(language: String?): Translation?
    • Returns the translation matching the given BCP-47 tag.
    • If not found (or if no language code is given), falls back on these language codes, in order:
      1. the default user locale
      2. null
      3. und
      4. en
      5. or the first translation found in the map (this might be a problem since maps are not ordered)
  • (property) defaultTranslation: Translation? = getOrFallback(null)
  • (property) string: String = defaultTranslation.string
  • (static) fromJSON(json): LocalizedString?
    • Creates a LocalizedString from a JSON string or JSON BCP–47 language map.
  • (static) fromString(strings: Map<String?, String>): LocalizedString
    • Creates a LocalizedString from a map of strings. It's convenient when parsing a package.
  • There are some additional APIs to help build or modify a LocalizedString, since it is immutable.

So as you can see, metadata.title is actually an alias to metadata.localizedTitle.getOrFallback(null).string, which ideally returns the translation matching the user's locale. Which matches what @llemeurfr said:

IMO there is no default language embedded in the publication. There is instead a preferred language (or a list of) in the reading app.

One thing we might want to discuss is the heuristics to decide how to fallback on the default translation. It would be nice to be able to use the publication's first language instead of null or en, but we don't have access to it in LocalizedString, unless we provide it at construction.

mickael-menu avatar Mar 20 '20 09:03 mickael-menu

If we don't know the language, then the key can be null (e.g. when parsing a RWPM). But with EPUB, we try to use the xml:lang element, or fallback on the publication's language (@qnga might chime in on this).

Sure, I can chime in. I think I already suggested somewhere to drop this fallback on the publication's language. This behaviour looks like an unjustified and unnecessary assertion since RWPM supports a non specified language. When directly parsing a RWPM title with no specified language, no such an assertion is made, and as far as I know, this interpretation is in no way favoured by the Epub specification.

qnga avatar Mar 20 '20 10:03 qnga

I think I already suggested somewhere to drop this fallback on the publication's language.

I agree with you, and it would lead to simpler parsing. I think only the Kotlin implementation falls back on the publication language right now.

mickael-menu avatar Mar 20 '20 11:03 mickael-menu

In the TypeScript implementation, for "contributors" metadata (e.g. author), as well as for title and subtitle metadata, we use the underscore _ pseudo-language-key as a fallback for cases where there are "alternative scripts" declared in the package OPF (as per the EPUB3 definition), and when the parser cannot determine the language of the string based on XML lang attribute (on the meta itself, or package OPF root element), or failing that, use the "primary" package OPF meta language instead (i.e. "primary" = first item in the array). Obviously, _ is not a great solution, so I will migrate to und instead. Thanks Mickael for pointing this out.

Current parser algo inspired from: https://github.com/readium/architecture/blob/master/streamer/parser/metadata.md#title

danielweck avatar Mar 20 '20 12:03 danielweck

Man! I was looking for something like und

Thanks for the analysis, everyone! 👍

jccr avatar Mar 20 '20 21:03 jccr

I think I already suggested somewhere to drop this [language of the publication] fallback on the publication's language.

This is exactly what I myself did in the Go implementation for the LCP server, when parsing W3C Manifests, as the low level json unmarshalling of a Localizable string would then rely on a global variable (the global language of the publication) and this would lead to a terrible implementation.

As qnga said, in EPUB the language of the publication (which may be multiple) is not directly related to the language of its metadata.

W3C Publication are slightly different because there are two different properties: inLanguage for the publication and a top level language (here) for the manifest, -> metadata. But we can be pretty certain that the latter will not be used before long, and there is no corresponding property in the RWPM.

In conclusion, I think we can rephrase Mickaël's wording as: If we don't know the language (because the property is expressed as a plain string), then the key is "und".

llemeurfr avatar Apr 30 '20 14:04 llemeurfr

W3C Publication are slightly different because ...

For all intents and purposes, isn't EPUB OPF's xml:lang the same as W3C WebPub's @context language? (and EPUB OPF's metadata dc:language the same as W3C WebPub's inLanguage)

danielweck avatar Apr 30 '20 14:04 danielweck

@danielweck you're right, xml:lang in EPUB has the same use than the @context / language in JSON-LD.

llemeurfr avatar Apr 30 '20 14:04 llemeurfr