schema icon indicating copy to clipboard operation
schema copied to clipboard

Should we add `translated-title` and `transliterated-title` to the objectified title?

Open denismaier opened this issue 5 years ago • 48 comments
trafficstars

As a follow-up on converting titles into objects, I think we should discuss whether there is any value in adding alternate forms (translated or transliterated title forms) to these title objects. Maybe so:

title:
  main: "Война и миръ"
alternate:
  - main: "War and peace"
    type:  "translated"
  - main: "Vojna i mir"
    type: "transliteration"

Or just so:

title:
  main: "Война и миръ"
  translated:
    main: "War and peace"
  transliterated:
    main: "Vojna i mir"

denismaier avatar Jul 22 '20 21:07 denismaier

@fbennett - you know more about this area than us. Thoughts?

bdarcus avatar Jul 22 '20 23:07 bdarcus

There are various transliteration schemes (roman, cyrillic, many others), and some languages require a native transliteration for basic sorting (hiragana in Japanese, Taiwan sorts Unicode glyphs directly, but I'm not sure what they do in the PRC). So you need to provide for multiple transliterations, and key them by ID. BCP 47/RFC 5646 is a robust scheme covering pretty much everything. It can be validated loosely (regexp-wise) or tightly (by extending a regexp-wise scheme with a controlled list of allowed values---here's the raw data of the registry, batteries not included).

Translation may be into multiple languages also, so the same applies there.

Also, the language of (what I call) the "headline" field may differ from that of the item. In a scheme that pre-parses titles into main- sub- and short- elements, there would be a design decision over whether to apply a headline-field language to the entire set, or to the individual elements separately. Also, there would be an issue over whether to require full-parallel translation/transliteration of all sub-fields within the structured title (i.e. whether to allow a French version of the main title without also requiring a French version of the subtitle, and so forth).

Whatever scheme is applied to a structured title should also be available on some text fields, and on creator fields with their different structure. A "CSL JSON" export from Jurism would show the structures I've come up with over there, for what that's worth.

fbennett avatar Jul 23 '20 01:07 fbennett

@fbennett Can you post a Jurism CSL JSON export with some translated/transliterated fields or point to a test with some? I'm not familiar enough with the Jurism GUI to make such an item quickly.

Also @fbennett could you explain what the source>target structure for field languages conveys?

On the one hand, a single translated-title field accomplishes the needs of most citation styles and is easy for styles, applications, and processors to implement.

On the other hand, implementing translated and transliterated forms of all fields allows for more robust multilingual support and handling of things like native Japanese sorting by hiragana. If we did something like this, I would suggest we make specific translated and transliterated elements of relevant fields, with an indication of the language. These elements would be understood as translation/transliterations of the fields, not of the item content. Handling transliteration/translation across the board would be a much bigger lift in terms of applications, styles, and processors (e.g., we would like need something like the CSLm cs:alternative to handle general rendering of field translations; the GUI for multilingual fields in Jurism is much more complex than corresponding GUI in Zotero). I would suggest that we make these and other expanded multilingual features a discrete CSL module. A processor might choose to support those features or not and can explicitly declare so.

bwiernik avatar Jul 29 '20 22:07 bwiernik

For Frank's question of language codes applied to sub-elements of a title, I think we could adopt a general inheritance mechanism. If a subfield lacks a language code, it inherits from the parent field; if a field lacks a language code, it inherits from the item.

bwiernik avatar Jul 29 '20 23:07 bwiernik

Can you post a Jurism CSL JSON export with some translated/transliterated fields or point to a test with some?

Here is the data behind this citation:

Ume Kenjirō (梅謙次郎), Commentary on the Civil Code [民法要義], 5 vols. (Tokyo: Yuhikaku Publications, 1898).

[
    {
        "type": "book",
        "multi": {
            "main": {
                "event-place": "ja",
                "publisher": "ja",
                "publisher-place": "ja",
                "title": "ja"
            },
            "_keys": {
                "event-place": {
                    "en": "Tokyo",
                    "ja-alalc97": "Tōkyō",
                    "ja-Hira": "とうきょう"
                },
                "publisher": {
                    "en": "Yuhikaku Publications",
                    "ja-alalc97": "Yūhikaku shobō",
                    "ja-Hira": "ゆうひかくしょぼう"
                },
                "publisher-place": {
                    "en": "Tokyo",
                    "ja-alalc97": "Tōkyō",
                    "ja-Hira": "とうきょう"
                },
                "title": {
                    "en": "Commentary on the Civil Code",
                    "ja-alalc97": "Minpō yōgi",
                    "ja-Hira": "みんぽうようぎ"
                }
            }
        },
        "event-place": "東京",
        "language": "ja",
        "number-of-volumes": "5",
        "publisher": "有斐閣書房",
        "publisher-place": "東京",
        "title": "民法要義",
        "author": [
            {
                "family": "梅",
                "given": "謙次郎",
                "multi": {
                    "_key": {
                        "en": {
                            "family": "Ume",
                            "given": "Kenjiro"
                        },
                        "ja-Hira": {
                            "family": "うめ",
                            "given": "けんじろう"
                        },
                        "ja-alalc97": {
                            "family": "Ume",
                            "given": "Kenjirō"
                        }
                    },
                    "main": "ja"
                }
            }
        ],
        "issued": {
            "date-parts": [
                [
                    "1898"
                ]
            ]
        }
    }
]

fbennett avatar Jul 29 '20 23:07 fbennett

could you explain what the source>target structure for field languages conveys?

Jurism recognizes a vector in the Language field:

ja>en or en<ja

In those variables, the language code is mapped to the (English) name of the respective languages.

fbennett avatar Jul 30 '20 00:07 fbennett

Thanks @fbennett

Jurism recognizes a vector in the Language field

So this is for rendering the language in citations? So to say “In English” or “Translated from Japanese”?

bwiernik avatar Jul 30 '20 00:07 bwiernik

Yes, the variables are available in citations. We've used them for translated legal documents in theses, where the original has been destroyed or is no longer available.

fbennett avatar Jul 30 '20 00:07 fbennett

Jurism recognizes a vector in the Language field

So this is for rendering the language in citations? So to say “In English” or “Translated from Japanese”?

If I remember correctly, this can also be used for conditional rendering based on the language of the current document. Like, your item is en>fr, which means it's a french translation of an English item. If you're now writing an article in English, you can choose to only render the information about the English original, but omit information about the translation into French, which you don't really need in an English article. But if you're writing a French article you'll want to include information about the original and the translation in your citations. @fbennett Is that correct?

denismaier avatar Jul 30 '20 10:07 denismaier

I would suggest that we make these and other expanded multilingual features a discrete CSL module.

That sounds like a good approach. Perhaps you could elaborate a bit more how you think this might work?

Some time ago, @cormacrelf envisioned introducing syntax for enabling/disabling certain features or sets of features: https://discourse.citationstyles.org/t/csl-1-2-planning/1476/7 Multilingual support could fit into this.

denismaier avatar Jul 30 '20 10:07 denismaier

@fbennett Why is it that you have two multi objects in your CSL JSON export, one under author, and the other as a top level object containing standard variables? What's the reason for this? Is this better than just having one multi object at the top? Or one multi object under each variable?

denismaier avatar Jul 30 '20 10:07 denismaier

On the other hand, implementing translated and transliterated forms of all fields allows for more robust multilingual support and handling of things like native Japanese sorting by hiragana. If we did something like this, I would suggest we make specific translated and transliterated elements of relevant fields, with an indication of the language.

We could either adopt the current CSLm JSON or simplify a bit to something like:

title: An English title
title--de: Ein englischer Titel

Something like this has been on the table anyway, see https://juris-m.readthedocs.io/en/latest/dev-sync-simplification.html

Handling transliteration/translation across the board would be a much bigger lift in terms of applications, styles, and processors (e.g., we would like need something like the CSLm cs:alternative to handle general rendering of field translations; the GUI for multilingual fields in Jurism is much more complex than corresponding GUI in Zotero).

Strictly speaking, cs:alternative is not necessary as you can use transliterations and translations in Jurism even with CSL 1.0.1 styles without adjusting the styles. This is currently done by processor directives:

>>===== LANGPARAMS =====>>
{
    "institutions": [
        "orig"
    ],
    "persons": [
        "orig"
    ],
    "titles": [
        "orig",
        "translat"
    ],
    "journals": [
        "orig"
    ],
    "places": [
        "orig"
    ],
    "publishers": [
        "orig"
    ]
}
<<===== LANGPARAMS =====<<

So, this will instruct citeproc-js to use the original variables for all types of variables, but for title variables it will also use the translated variant.

denismaier avatar Jul 30 '20 11:07 denismaier

I don't think we should do full ML anytime soon; certainly not for 1.1.

Really the question I had for 1.1 is if we move variables like translated-title to the title object, without otherwise modifying them.

bdarcus avatar Jul 30 '20 11:07 bdarcus

I don't think we should do full ML anytime soon; certainly not for 1.1.

Agreed. There are three factors I'm considering.

  1. A separate translated-title variable covers most citation needs and is clear and simple to implement—it's just another title variable and it can be called like any other variable. That simplicity has value.
  2. A translated slot in the title object would either require additional forms (form="translated", form="translated-short", form="translated-main", form="translated-sub") or perhaps a new attribute (translated="true"). The new attribute might be better (e.g., a test could be <if variable="title" translated="true">).
  3. If we did create the option for full ML at some point, then the title object option makes that easily extensible. The translated slot can then be an object with elements marked by their locale.

With these considerations, making translated a part of the title object might be the more future-proof option. I would suggest just translated, rather than also transliterated (in a full ML solution, both could be represented using different locale codes).

If we did move translated to the title object, I suggest we pull the separate variable from v1.0.2.

bwiernik avatar Jul 30 '20 13:07 bwiernik

If I remember correctly, this can also be used for conditional rendering based on the language of the current document. Like, your item is en>fr, which means it's a french translation of an English item. If you're now writing an article in English, you can choose to only render the information about the English original, but omit information about the translation into French…

I don't think we should adopt this. With the @related structure, we provide a more formal and integrated way of referring to original item information. If you are citing a translation, you should always cite it as a translation. If you want to cite the original, then cite that as a separate item instead.

bwiernik avatar Jul 30 '20 13:07 bwiernik

We can discuss multilingual data structures in another thread, but my inclination would be for all of this to occur at the field-level. So, any field might be object with value, language, and translated elements. The translated element would be an array with elements holding value and language elements. Subordinate elements without a language would inherit language from their parent. That would have 3 benefits:

  1. It would permit simple indication that a field is a different language than the item (e.g., an English article published in a German journal).
  2. It would jive with https://juris-m.readthedocs.io/en/latest/dev-sync-simplification.html
  3. It would provide a consistent structure for providing translations of one or more fields for an item.

bwiernik avatar Jul 30 '20 14:07 bwiernik

If we did move translated to the title object, I suggest we pull the separate variable from v1.0.2.

My impulse is we should do this. The only reason I think not to is if it presented some future barrier to fuller ML support.

@denismaier - thoughts?

We can discuss multilingual data structures in another thread ...

Maybe take this comment and turn it into an issue ("reference in new issue"), for future reference?

bdarcus avatar Jul 30 '20 14:07 bdarcus

if it presented some future barrier to fuller ML support

I think it would be the opposite; doing it would make fuller ML support easier.

Maybe take this comment and turn it into an issue ("reference in new issue"), for future reference?

Cool! Didn't know that button existed.

bwiernik avatar Jul 30 '20 14:07 bwiernik

A translated slot in the title object would either require additional forms (form="translated", form="translated-short", form="translated-main", form="translated-sub") or perhaps a new attribute (translated="true"). The new attribute might be better (e.g., a test could be <if variable="title" translated="true">).

So are we talking a PR with this:

title:
  translated: foo
  main: bar

.. or this?

title:
  translated: 
    main: foo
  main: bar

I guess the latter?

And then remove the translated-title variables from v1.0.2, and finally add a new attribute to access them in styles.

Maybe, per @denismaier's initial impulse, we call it alternate or variant; or even, to be more specific, language-alternate?

That would give more future flexibility, should we possibly need it.

bdarcus avatar Jul 30 '20 15:07 bdarcus

The second option. language-alternate sounds good.

bwiernik avatar Jul 30 '20 16:07 bwiernik

@fbennett Why is it that you have two multi objects in your CSL JSON export, one under author, and the other as a top level object containing standard variables? What's the reason for this? Is this better than just having one multi object at the top? Or one multi object under each variable?

The aim was (and is) to maintain compatibility with CSL-JSON to the extent possible. Ordinary fields are strings, so it's not possible to give them a sub-field without changing the data type. Creator variables are already objects, so a subfield can be added without changing the data type; and since creator fields are dynamic, it makes sense to tie the variants to each name instance---and CSLm-JSON just reflects that structure, which keeps exports simple.

fbennett avatar Jul 30 '20 16:07 fbennett

Yes, we should add translated-title to the title object, rename accordingly, and remove from 1.0.2. That's a good move.

language-alternate sounds good. Or what about language-alternative?

In terms of structure, it should mirror the standard structure of title variables, so:

title:
  main: A title
  sub: with a subtitle
  language-alternate: 
    main: An alternate title
    sub: with a subtitle

Such a structure would be extensible if need arises. We can add language variables, type variables to indicate if the alternate is a translation or a transliteration, and convert language-alternate to an object or an array, if we need more than one alternate title. (This won't be needed for most citation needs, but if CSL JSON should serve as an exchange format then that's a different story.)

denismaier avatar Jul 30 '20 20:07 denismaier

I don’t think a type is necessary. That will be clear from the language code (as in the CSLm JSON example above).

bwiernik avatar Jul 30 '20 20:07 bwiernik

language-alternate sounds good. Or what about language-alternative?

I'm agnostic.

In terms of structure, it should mirror the standard structure of title variables

This attribute is broader, and it's values would be things like "translated." So I don't think they need to mirror each other; do they?

bdarcus avatar Jul 30 '20 20:07 bdarcus

This attribute is broader, and it's values would be things like "translated." So I don't think they need to mirror each other; do they?

Yes, it's broader. I just wanted to point out that it shouldn't just be a flat string, but have distinct properties for title parts.

denismaier avatar Jul 30 '20 20:07 denismaier

This attribute is broader, and it's values would be things like "translated." So I don't think they need to mirror each other; do they?

No, I don't think the values should be "translated", etc. We should go one of three ways.

  1. A single "language-alternate" field, whose structure matches the structure of a title variable exactly. This would be analogous to having a separate translated-title variable.
  2. An object with properties being language codes (e.g., "fr" or "de-CH" or "ja-hiranga") as elements, each containing a title variable object (without further language-alternate fields).
  3. An array whose elements are each a title variable object with a mandatory language field. The language field needs to be a language code identifying the language and writing system (e.g., "fr" or "de-CH" or "ja-hiranga"). Whether it's a translation or transliteration is clear from the language code.

Of these, 1 and 3 are compatible with each other. We could do 1 now, but then easily add 3 as an option in a future version or in a multilingual extension.

bwiernik avatar Jul 30 '20 20:07 bwiernik

My thinking is that we should make a solution that flows easily into having multiple alternates for multilingual support (or even just picking a translation based on the document locale). We could even fairly easily do (3) in v1.1 without the expectation of full ML support by:

  1. Adding a language element to the title object
  2. Make language-alternate an array

That honestly might be the most straightforward approach.

bwiernik avatar Jul 30 '20 20:07 bwiernik

Whether it's a translation or transliteration is clear from the language code.

How so?

You mean by virtue of it being under an alternate-language property?

bdarcus avatar Jul 30 '20 20:07 bdarcus

Whether it's a translation or transliteration is clear from the language code.

How so?

You mean by virtue of it being under an alternate-language property?

I guess "translation or transliteration" means two different things in that sentence... Certain language codes refer to transliterations: e.g. he-alalc97 transliterated according to the Library of Congress Romanization rules.

denismaier avatar Jul 30 '20 21:07 denismaier

The BCP 47/RFC 5646 scheme Frank linked to defines languages codes unambiguously not only for languages/locales but also for scripts and the like. It's summarized here. The basic structure is language-script-region, with each part following defined patterns.

For example, if an item with language: ru contains a language-alternate element with language: ru-Latn, that means it's a romanized transliteration of the title. A language-alternate element with language: en would mean an English translation of the the title.

bwiernik avatar Jul 30 '20 21:07 bwiernik