biblatex icon indicating copy to clipboard operation
biblatex copied to clipboard

Use standardized language identifiers for lbx files

Open pauloney opened this issue 11 years ago • 192 comments

Is there are "template" one can use to make the translations to be used in language.lbx? Or should that be done on top of one of the existing files?

I would like to create the files for Romanian, Vietnamese, Chinese and Japanese and I do have people in the office which are capable of making the translations and have experience with Bibliographies, but NONE of them are programmers.

Also: Is there a guide on how to add a new language support ? Even though it is easy to understand what goes on inside \DeclareBibliographyStrings{ }, I would like to know when is preferable to use tex-encoding as supposed to utf8, for example?

Other questions are:

1- Can one add support for a language that is not supported by Babel?

2- When do one use \adddot and when does one use \adddotspace ?

3- Why country support (within language.lbx) is limited to Germany, EU, US, France and GB ?

4- Are you using a framework to do this? In general it is easier to manage them in a single spreadsheet with the translations to each language in each column and a script that reads the column and writes the LBX files! The translators can then easily compare to "nearby" languages and easily make other translations.

Is work by others on this kind of issue welcomed ?

Thanks for the great package! Paulo Ney

pauloney avatar Sep 20 '13 12:09 pauloney

General instructions can be found at the old SF wiki for biblatex (edit that file is now in an updated version on the GitHub wiki, please use: https://github.com/plk/biblatex/wiki/Checklist-for-submitting-a-new-localisation-file-(.lbx)). For testing see example 03-localization-keys in the documentation. You can use english.lbx as a starting point. Just complete \DeclareBibliographyStrings; we take care of the rest on the basis of your answers to the questions listed in the wiki.

Languages based on the the Latin alphabet should be encoded in Ascii. That way they will be supported by any backend (BibTeX variants and biber).

Regarding your other questions:

  1. Not yet. The other maintainers are working on polyglossia support, but this is not an easy task.
  2. The output of X\adddotspace Y is less likely to have a linebreak between "X." and "Y" than X\adddot\ Y. Decisions on whitespace and punctuation are ideally be made by the translator. Refer to the manual sections on adding punctuation and whitespace for further details.
  3. From the manual: "only a small number of country names is defined by default, mainly to illustrate this scheme". If we support all possible country*, patent* and patreq* strings you can imagine that this can get unwieldy.
  4. There are no further support files aside from the resources I mentioned above. A script might be helpful, but there are few things to keep in mind: (1) contributors are working on different platforms, (2) version control of the spreadsheet could prove challenging with multiple editors and (3) for testing, working with the lbx file directly is probably more convenient.

aboruvka avatar Sep 20 '13 14:09 aboruvka

Audrey, Thanks for the quick answer. I'll build the framework that takes to produce some of the several lbx files that will be necessary to really i18n Biblatex. The problem of working with a single lbx file at a time is that you can't compare to nearby languages or change your mind later about a better translation - because after you written a token, the original is gone. I'll read in all existing lbx files and the produce the database that will be needed to drive the manufacturing of the new ones. There are 45 languages in Babel which are not in Biblatex, so it will take an organized effort of more than just programmers to get there.

I understand the problems of keeping track of changes (via GitHub) and testing are serious, but I'll produce the files and send them ready to you.

Supporting all country* strings is fairly easy! I already have most country names in some 200 languages, so I'll produce the files and make them available to you, if you want to use... but I would strongly recommend the use of a separate file for that - so as not to overload the lbx's.

Before I get started I have one small question. You mention the use of ".\isdot" on the Wiki page, but I do not see any occurrence of that on any of the lbx's. Is that really necessary at this level?

Thanks! Paulo Ney

pauloney avatar Sep 20 '13 18:09 pauloney

I'm for doing this if we can have a way for the releaser(s) (which is currently me) to generate all current .lbx files for a release on demand. I would prefer something like a db and a pull interface in Perl (biber is all in perl ...) which generates the .lbx. If the db was something like SQLlite, the db could also be in the biblatex git repo. The problem then is that contributors would either still send diffs against the generated .lbx or would need to look at the db, which is probably out of the question for most people. Text files are easier for this but, as you say, we don't get tehe coverage or consistency we need in future.

plk avatar Sep 20 '13 18:09 plk

Philipp Lehman wrote the wiki page. I'm not sure what use-case he had in mind for .\isdot. AFAIK it isn't necessary. You could consider using \isdot in place of \adddot if the string preceding is the output of some command, which may or may not end with a period.

About overloading the lbx files or separate files for country-specific strings - this is what I meant by "unwieldy". Certain aspects of the core biblatex styles are demonstrative rather than exhaustive. This is one good example. Users can easily extend the lbx files. If you're wanting to share all those extra strings with others, consider an add-on package.

The DB/spreadsheet could be maintained similar to the localization keys document - just an extra resource, but not necessary for contributing lbx files. Note that only a fraction of the current lbx files are actually complete, so between-language comparisons are limited.

aboruvka avatar Sep 20 '13 18:09 aboruvka

It will take some building, but I think it is the only way to go! Imagine making a structural change and have to change/test 45 lbx files! I also want to build an interface so one can choose 3 to 4 languages to compare/edit the DB - as the coverage get bigger that will be more important.

Changes to lbx's made by users or entered directly on GiHub should integrate easily with the DB back... because those will continue to happen!

I'll take a detour and come back when I am able to generate all the current 20 essential lbx files exactly the way they are right now.

PN

pauloney avatar Sep 20 '13 19:09 pauloney

I am almost done with the back-end to produce the lbx files from a DB. I can produce lbx files that are almost identical to the existing ones and get some 50 more languages in the fray.... the problem here will be to get Babel to do the same thing ...but at this point I have an important question:

Why are we using a separate i18n LBX set of files, if we could use the ones from the CSL project at

   https://github.com/citation-style-language/locales

In my (uninformed) way to view it, there are plenty of reasons to use it instead of the lbx's:

  1. They are ready!
  2. They are in XML, making it a lot easier to test consistency, etc ...

Paulo Ney

pauloney avatar Sep 30 '13 18:09 pauloney

I like the idea of using standards like this but there are some things to consider though:

  1. Do they cover all of our strings?
  2. We'd have to convert to .lbx because it's much faster for biblatex to read them since they are TeX. XML parsing in TeX is not something you ever want to do.
  3. We'd have to make sure babel/polyglossia language ids are correct.
  4. We'd have to support things like \adddot etc. somehow since lots of .lbx files do this.
  5. There are special things in some .lbx files - all sorts of biblatex settings - we'd have to insert those.

plk avatar Sep 30 '13 18:09 plk

Answering each of your questions/comments:

  1. No! There are lots in common, but coverage is different, they have some strings that we don't and same way in reverse. Interesting question is WHY ? They in fact should be almost the same since the problem is the same! :)
  2. That settles it! I am glad have a very definite argument! :(. Instead of converting from XML --> LBX and running the danger of not having a complete lbx file back, what I am doing is parsing all LBX's and XML's files in the database, sorting out some conflicting areas by hand, and then exporting way more complete LBX files, and adding a few languages in the process.
  3. This is an area that deserves some immediate standardization! It is wrong to do it by "language" because of the pt_PT/pt_BR, en_GB/en_US/en_CA,... discrepancies. The files should be really labeled by "locale" (which is a standard) and possibly ask the Babel/Polyglossia people to do the same. If you look at the way Babel names the files the is NO procedure in place, each one gets named at one point in time in a different way - including "portuges.lbx" that was named in this fashion (with two errors) because of the DOS restriction on filenames.
  4. That is the case with the XML files as well since there are abbreviations that use a DOT and some that don't... unless I am missing something here.
  5. I am dealing with it considering that every lbx file has a (fixed) pre-amble and a post-amble, and each of them gets picked up and built at the time the file is generated.

PN

pauloney avatar Sep 30 '13 19:09 pauloney

Well we could consider the CSL route later if they were more to our needs but currently, they're not really. I had this argument with the "generic bib system" people a few years ago - they didn't seem to understand that high-quality bib typesetting needs semantic integration into the typesetting - there is no good "generic" solution ... If you can generate identical .lbx files to our current ones, let's discuss further ... which database are you using?

plk avatar Sep 30 '13 19:09 plk

I can produce identical lbx's already. When they differ, it is because the original lbx's have something wrong - a space out of place, etc ...

I am using MySQL because at the moment is what I have in one particular server that I am interacting with someone lese on the project, but writing very generic code that could be changed to anything.

I would like to add that one more advantage of doing this via the DB, is that you then can interface with people all over, which are interested in i18n of biblatex. They would just need to enter the data in a interface and their lbx files could be exported and later included in the distribution.

pauloney avatar Sep 30 '13 19:09 pauloney

Ok - what language are you using for data extraction and creation of .lbxs?

plk avatar Sep 30 '13 19:09 plk

Perl.

pauloney avatar Sep 30 '13 19:09 pauloney

Good. Biber is all in perl too. Perhaps you could send me a MySQL dump and the perl? I'd like to have a look at it.

plk avatar Sep 30 '13 19:09 plk

Sure! Give me sometime to wrap it up ... I am sorting the issues with translations in to languages that have "gender" right now (so I can parse in the XML) and sort a few other edges and send you the stuff. It is just one script.

pauloney avatar Sep 30 '13 19:09 pauloney

No rush, many thanks. We'd then have to think about hosting this in some way or perhaps using SQL lite and keeping just a db file in the git repository etc.

plk avatar Sep 30 '13 19:09 plk

One thing I realized today writing the maps to parse the XML files of CSL, is that they have a nice way to recognize the gender and number (singular or plural) of words in other languages that is NOT present in the lbx file structure!

To translate a phrase like

Translated and Annotated by ...

to languages like Portuguese and Spanish requires one to know the gender of the entity being translated and annotated. If it is a book or a an Album will be masculine, but it if is is a Collection or a Thesis it will be feminine. So I don't really see how this could be done in the realm of the current lbx's files.

Would someone mind sharing the wisdom on how these problems with be dealt with ?

PN

pauloney avatar Oct 01 '13 04:10 pauloney

@aboruvka - do you have a comment on this?

plk avatar Oct 01 '13 08:10 plk

Gender specific strings come up with idem*. These can be selected on the basis of the gender field.

idemsf feminine singular form of idem idemsm masculine singular form of idem idemsn neuter singular form of idem idempf feminine plural form of idem idempm masculine plural form of idem idempn neuter plural form of idem idempp plural form of idem suitable for a mixed gender list of names

Some languages use masculine or feminine ordinals depending on the gender of item being indexed (e.g. series or edition). These are handled on the translator's end with the bibliography "extras" questions I mentioned earlier.

For the "by" roles, you could simply add gender/number-specific variants provided that the gender/number of the work is strongly tied to the entrytype (e.g. @book entries are always masculine-singular, @mvbook masculine-plural, @collection feminine-plural, etc). Note that album entrytypes are not formally supported and the @thesis entrytype doesn't support the role fields (only one person works on a thesis anyway).

The same problem has been mentioned in #48 for non-"by" roles, where the gender/number would be specific to the people filling the role. The strings already consider number because this is available in name list processing. Gender would have to be indicated explicitly in the entry somehow.

aboruvka avatar Oct 01 '13 13:10 aboruvka

Thanks! That should do it.

pauloney avatar Oct 01 '13 13:10 pauloney

Not quite. There is work on our end to be done. The bibliography extras questions would also need expanding to ask about the gender and number of @article, @book, @mvbook, @inbook, @collection, @incollection, and @mvcollection.

I'm saying it is probably do-able, but we have to consider work required to get this done, the relative demand for the new feature, and potential issues the feature might open up. If PL knew about this limitation and decided not to implement it, he likely had a very good reason.

aboruvka avatar Oct 01 '13 14:10 aboruvka

PLK, Audrey, I am down to the wire, and about to start the last upload to the db and the last series of tests. Should I grab a set of fresh lbx files from the development branch ? Or use the last public release?

pauloney avatar Oct 01 '13 20:10 pauloney

Always grab from DEV - it's more up to date ...

plk avatar Oct 02 '13 05:10 plk

One of the hardest things I had to deal with in this side project was the fact that "language" and "locale" are mixed inside BibLatex in some unreasonable ways. It is true that most of what in inherits (or uses) from Babel is in the form of language, but the LBX files contain so much about "locale" that is impossible to do it all in the realm of language only.

When one say that an entry should have "hyphenation = {portuguese}" that is all good and okay, but the entry:

language = {portuguese}

should never be expected format an entry properly because Iran, Bahamas, Kazakhstan, ... are written in one way in pt_PT and in another way in pt_BR.

In order to circumvent my difficulties introducing the translated terms in a DB and importing some new ones I had to literally introduce locales in my table of languages and vice-versa... something a programmer should never have todo!

Now that internationalization is really coming, in order to manage this well and be able to expand in the realm of languages that have many many locales it would be nicer to split this two roles well. I know that, for Portuguese alone there is a portuguese.lbx, portuges.lbx, brazil.lbx and brazilian.lbx - but it is extremely hard to maintain in the way it is laid out, eliminate duplicate and deal with inconsistencies. One should have a unique file "portuguese.lbx" and a couple additional pt-BR.lbx and pt-PT.lbx that should call the main one and define some small local components.

Labeling of language and locale should follow standards (ISO and IETF) so one can interchange with other Bibliography management software and compatibility with the name space of Babel should be an internal issue and the user should never have to deal with that at a bibliography entry level.

Just my 2cents!

Paulo Ney

pauloney avatar Oct 02 '13 19:10 pauloney

With the 2.8 DEV branch, I'm moving away from the hyphenation field and re-naming it langid since that's what it is - it's a language ID in babel (or, with 2.8, polyglossia too). There will be a langidopts for specifying polyglossia language options like variant names ("american" and "british" for the langid "english" etc.). The language field is just a printed field - not used to localise anything - it's misleading, I agree.

plk avatar Oct 02 '13 20:10 plk

Lines 461-462 of the english.lbx file have a curious entry:

countryeu = {{European Union}{EU}}, countryep = {{European Union}{EP}},

can anyone tell me what the second line means ?

Paulo Ney

pauloney avatar Oct 11 '13 20:10 pauloney

I should have said that I saw this:

\keyitem{countryeu} The name <European Union>, abbreviated as \vrb{EU}. \keyitem{countryep} Similar to \vrb{countryeu} but abbreviated as \vrb{EP}. This is intended for \bibfield{patent} entries.

in the examples, but I continue puzzled by the meaning of it...

Paulo Ney

pauloney avatar Oct 11 '13 20:10 pauloney

Good question - @aboruvka - any idea? It looks to me like a copy-paste which should read:

countryep = {{European Patent}{EP}},

?

plk avatar Oct 11 '13 20:10 plk

No idea. I don't think it is a mistake, though, because then countryep would be redundant with patenteu.

aboruvka avatar Oct 11 '13 20:10 aboruvka

I am not sure I understand your phrase! It is redundant, but you don't think it is a mistake ?

pauloney avatar Oct 14 '13 10:10 pauloney

Hi People! I am mostly done with the framework to deal with the translations, and I am able now to write "identical" LBX files and at the same time use the DB to do the wonderful things I mentioned, like:

  • acquire new translations
  • acquire translations from other Open Source projects like CSL
  • check on the quality of translations of each token individually
  • use the power of the db to complete many of the incomplete lbx files
  • write many more (about 150) other lbx files.
  • organize/name files according to ISO standards.

In doing so, there are always a few choices here and there, on the next few e-mails I'll report on the most important to make sure you all agree with them. Then later I have a few questions o what is the preferred way to write the files, etc ...

If this is not the correct place for this, please le me know!

Paulo Ney

pauloney avatar Oct 14 '13 11:10 pauloney