beets icon indicating copy to clipboard operation
beets copied to clipboard

Romaji transliteration for Japanese metadata

Open Sinity opened this issue 8 years ago • 28 comments

Can this be implemented, somehow?

Sinity avatar Mar 22 '16 13:03 Sinity

Could you explain a little more about how this works? Is there an automatic way to transliterate this text, and is there software out there to do it?

It should probably be a separate config option, not tied to the import config.

sampsyo avatar Mar 22 '16 16:03 sampsyo

Well, I thought there could be some source site added which provides lyrics in romanji. But when I think about it I doubt anything like that exists, with sufficient quality and quantity of lyrics.

I actually haven't thought of transliterating 'normal' lyrics to romanji, but I think that would work well.

I've researched it a bit, and I think this package would solve the problem best:

https://github.com/kevincobain2000/jProcessing#kanji-katakana-hiragana-to-tokenized-romaji-jconvert-py

(it already leads to usage example)

Also, accepted answer here: http://stackoverflow.com/questions/5827439/any-tools-to-programmatically-convert-japanese-sentence-into-its-romaji-phoneti

provides some solution with external software.

"It should probably be a separate config option, not tied to the import config."

I guess that's right.

So, lyrics plugin should, when updating some lyrics:

  1. Check if they contain Japanese characters.
  2. If so, check if 'lyrics: romanji'(or something named similarly) is set
  3. If it is, call a function/run a program on these lyrics, line by line, and returned results are final lyrics
  4. Write these final lyrics to tags/database

Sinity avatar Mar 22 '16 16:03 Sinity

OK, that make sense! Transliterating using an existing, off-the-shelf tool seems like a reasonable feature request.

sampsyo avatar Mar 22 '16 17:03 sampsyo

shouldn't it be called something more generic than romaji? I assume other languages can be transliterated as such.

ghost avatar Mar 22 '16 21:03 ghost

As a non-Japanese-speaker, I don't really know. It seems like either:

  • This is just a generic "transliteration to Roman characters" problem, which can be solved with unidecode or similar.
  • There are special transliteration rules for the Japanese language that need their own tools.

But I have no idea which is right.

sampsyo avatar Mar 22 '16 21:03 sampsyo

@jrobeson

Well, all alphabets can be transliterated to other ones, AFAIK - you just change one symbol to another(or one symbol to a set of symbols, or a set of symbols to one symbol...).

But I don't think there would be much demand for much more. Maybe Chineese and Korean alphabets?

EDIT:

I've checked few Hiragana characters with this 'unidecode' package, and it seems to work. I can't think of any reason why Japaneese would need some special treatment. So I think it would make sense to just make generalized 'lyrics: romanize' option which would call unidecode.unidecode on lyrics before writing them to database/tags.

Also, a bit offtopic, maybe this option could be added globally? So for example titles and artists would all be romanized? That would get rid of diacritics as well, if someone does not want them. It could be configurable, so some chars. would be left alone, while others would be transliterated?

Maybe as a plugin? It could even be standalone, without touching lyrics. For example, upon importing new music, metadata would be fetched from MusicBrainz, then other plugins like lyrics would process it as well, and at the end 'romanize' plugin would process all metadata if 'auto' option is true?

Can plugin control when it's invoked(so it would be last)?

Sinity avatar Mar 22 '16 21:03 Sinity

hiragana and katakana can be transliterated 1 to 1, but kanji characters can't.

many characters have multiple readings、even with the days of the month there are multiple readings

一日 = tsuitachi= first day of the month 二日 = futsuka = second day of the month

ghost avatar Mar 22 '16 22:03 ghost

there are also multiple romanization methods. Quoting: https://en.wikipedia.org/wiki/Romanization_of_Japanese : "There are several different romanization systems. The three main ones are Hepburn romanization, Kunrei-shiki Rōmaji (ISO 3602), and Nihon-shiki Rōmaji (ISO 3602 Strict). Variants of the Hepburn system are the most widely used."

ghost avatar Mar 22 '16 22:03 ghost

So it won't be that simple :(

I guess best way would be to start with using 'unidecode', and then gradually add different handlers for different unicode ranges.

Sinity avatar Mar 22 '16 22:03 Sinity

@Ezodev, can you try running some sample lyrics through unidecode to see if it gives reasonable results?

sampsyo avatar Mar 22 '16 23:03 sampsyo

it's not about ranges, it's about combinations, unidecode just won't do. You need to find something specific for japanese. I know libraries exist for this, but i'm not familiar with them. There are even different readings for combinations of hiragana and kanji vs kanji and kanji.

ghost avatar Mar 23 '16 01:03 ghost

So it sounds like there are lots of different options for the romaji transliteration. In that case, we have a bit more work to do to map out the exact set of options this functionality would need.

sampsyo avatar Mar 23 '16 01:03 sampsyo

there's really only one option which would be the romanization style, and it really doesn't have to be optional in the near term. The readings aren't something that are configurable, it just requires a library that does the right thing.

ghost avatar Mar 23 '16 03:03 ghost

Alternatively, you could feed it through the google translate api maybe?

ghost avatar Mar 23 '16 03:03 ghost

Got it; thanks for clarifying!

sampsyo avatar Mar 23 '16 05:03 sampsyo

@jrobeson "it's not about ranges, it's about combinations, unidecode just won't do. You need to find something specific for japanese. "

I meant that we could detect if a given character is within some range of Unicode codes, and based on that we would select the tool which would transliterate it.

Now I really think that it should be separate plugin after all. Because if someone wants to have lyrics in romaji, then he most likely want track titles and artists in romaji too.

@sampsyo

Is order of invocation of plugins controllable? Because that plugin should be invoked after, for example, lyrics and chroma plugin.

Sinity avatar Mar 23 '16 09:03 Sinity

@Ezodev : I'd actually prefer to have lyrics in both romaji and original. I guess it helps that I know a little bit of japanese though :)

I don't generally transliterate my track titles either. I like them (mostly) the way they are.

ghost avatar Mar 23 '16 09:03 ghost

However, in an unrelated question.. is there some some way to have track titles available in both languages/scripts, with one being canonical? I guess i'm not that familiar with the id3 (or other similiar metadata) spec

ghost avatar Mar 23 '16 09:03 ghost

@jrobeson

Well, I don't know Japanese :D I can't really discern track titles when they're in their alphabet.

About both versions... I think that should be easily doable. Just add option to do that. For example, format with tags %o and %t - %o for "insert original here" and %t for "insert transliterated tag here". And select tags which would be transliterated. Then we could have track titles like {transliterated}({original}) or {transliterated} - {original}.

About having both versions in tags...Even if it is supported by specification, it wouldn't be supported by most music players anyway...

Sinity avatar Mar 23 '16 10:03 Sinity

It would be good to always have the transliterated available in tags, if for nothing else than folder/file naming. Currently, if i processed my Yoko Kanno tracks with beets, i'd be unable to access them from the command line. I can't do a "beet ls ごちそうさん交響曲" on my querty keyboard even if i had any idea what to search for anyway.

bearcatsandor avatar Mar 23 '16 23:03 bearcatsandor

I've changed the title of the issue to reflect the desire to transliterate any kind of metadata.

sampsyo avatar Mar 24 '16 00:03 sampsyo

@bearcatsandor : you can if you installed an IME, but you'd definitely still have to know how 交響曲is phonetically pronounced.

ghost avatar Mar 24 '16 01:03 ghost

Throwing this in because it might be relevant.

MB handles transliteration of tracklists (artist + track titles) currently using pseudo releases. So original and transliterated spellings of a track end up in different releases. While there is an open issue for beets to work better with that (#654), it would probably be difficult to store both locally but nonetheless preferable over automagic stuff.

However, the May schema update for MB will change how this is supposed to work by introducing alternative tracklists. With this, one MB release can have many different tracklists, which can be used for alternative spellings of tracks, or translations/transliterations (http://tickets.musicbrainz.org/browse/MBS-4501). With all the information then available within one release, I suppose it would be easier to store it locally in beets. The only difficulty to figure out then is how to match all the alternative options to tags (i.e. which title to store in the ID3 title tag, Japanese, transliterated Romaji or translated English, etc).

pprkut avatar Mar 24 '16 07:03 pprkut

that's really neat @pprkut

ghost avatar Mar 24 '16 07:03 ghost

About storing both transliterated and original metadata, maybe we could just use flexible attributes? For example, each track processed with this plugin would have normal 'title' tag, and additionally 'original_title' and 'transliterated_title'? Content of 'title' tag would be determined from format of some config option: main possibilities would be:

  1. use transliterated title
  2. use original title - that would still allow us to query stuff if we know what it is in transliterated form
  3. use some combination of both: for example "{transliterated_tag} - {original_tag}"

Also, I think this plugin should be a bit more general than just Japanese -> romaji. As I said before, it could transliterate any 'special characters' to ASCII representation. We could have 'whitelist' option, for non-ASCII characters that should remain, so for example someone from Poland could leave their diacritics alone.

Then, when plugin processes given metadata, it does so char by char. For each character, it decides how it should handle it: default implementation would be to pass it through unidecode. For Japanese characters, it would grab all consecutive characters, and pass it though some specialized library. Other quirky symbols could be handled by yet another implementation.

I have a small question, are flexible attributes somehow written to files itself, or are they only present in the database?

Sinity avatar Mar 24 '16 13:03 Sinity

Flexible attributes only live in the database.

sampsyo avatar Mar 24 '16 14:03 sampsyo

I like @Ezodev's idea of a "replace" plugin that replaces certain characters in a song's metadata. I suggested something similar here that could "fix" #1893.

jackwilsdon avatar Mar 24 '16 14:03 jackwilsdon

I just discovered a really nice library for this! https://pypi.org/project/pykakasi/

And an improved fork of unidecode that builds on it called unihandecode.

ctrueden avatar Oct 26 '20 02:10 ctrueden