beets Romaji transliteration for Japanese metadata

Romaji transliteration for Japanese metadata

Open Sinity opened this issue 8 years ago • 28 comments

Can this be implemented, somehow?

Mar 22 '16 13:03 Sinity

Could you explain a little more about how this works? Is there an automatic way to transliterate this text, and is there software out there to do it?

It should probably be a separate config option, not tied to the import config.

Mar 22 '16 16:03 sampsyo

Well, I thought there could be some source site added which provides lyrics in romanji. But when I think about it I doubt anything like that exists, with sufficient quality and quantity of lyrics.

I actually haven't thought of transliterating 'normal' lyrics to romanji, but I think that would work well.

I've researched it a bit, and I think this package would solve the problem best:

https://github.com/kevincobain2000/jProcessing#kanji-katakana-hiragana-to-tokenized-romaji-jconvert-py

(it already leads to usage example)

Also, accepted answer here: http://stackoverflow.com/questions/5827439/any-tools-to-programmatically-convert-japanese-sentence-into-its-romaji-phoneti

provides some solution with external software.

"It should probably be a separate config option, not tied to the import config."

I guess that's right.

So, lyrics plugin should, when updating some lyrics:

Check if they contain Japanese characters.
If so, check if 'lyrics: romanji'(or something named similarly) is set
If it is, call a function/run a program on these lyrics, line by line, and returned results are final lyrics
Write these final lyrics to tags/database

Mar 22 '16 16:03 Sinity

OK, that make sense! Transliterating using an existing, off-the-shelf tool seems like a reasonable feature request.

Mar 22 '16 17:03 sampsyo

shouldn't it be called something more generic than romaji? I assume other languages can be transliterated as such.

Mar 22 '16 21:03 ghost

As a non-Japanese-speaker, I don't really know. It seems like either:

This is just a generic "transliteration to Roman characters" problem, which can be solved with unidecode or similar.
There are special transliteration rules for the Japanese language that need their own tools.

But I have no idea which is right.

Mar 22 '16 21:03 sampsyo

@jrobeson

Well, all alphabets can be transliterated to other ones, AFAIK - you just change one symbol to another(or one symbol to a set of symbols, or a set of symbols to one symbol...).

But I don't think there would be much demand for much more. Maybe Chineese and Korean alphabets?

EDIT:

I've checked few Hiragana characters with this 'unidecode' package, and it seems to work. I can't think of any reason why Japaneese would need some special treatment. So I think it would make sense to just make generalized 'lyrics: romanize' option which would call unidecode.unidecode on lyrics before writing them to database/tags.

Also, a bit offtopic, maybe this option could be added globally? So for example titles and artists would all be romanized? That would get rid of diacritics as well, if someone does not want them. It could be configurable, so some chars. would be left alone, while others would be transliterated?

Maybe as a plugin? It could even be standalone, without touching lyrics. For example, upon importing new music, metadata would be fetched from MusicBrainz, then other plugins like lyrics would process it as well, and at the end 'romanize' plugin would process all metadata if 'auto' option is true?

Can plugin control when it's invoked(so it would be last)?

Mar 22 '16 21:03 Sinity

hiragana and katakana can be transliterated 1 to 1, but kanji characters can't.

many characters have multiple readings、even with the days of the month there are multiple readings

一日 = tsuitachi= first day of the month 二日 = futsuka = second day of the month

Mar 22 '16 22:03 ghost

there are also multiple romanization methods. Quoting: https://en.wikipedia.org/wiki/Romanization_of_Japanese : "There are several different romanization systems. The three main ones are Hepburn romanization, Kunrei-shiki Rōmaji (ISO 3602), and Nihon-shiki Rōmaji (ISO 3602 Strict). Variants of the Hepburn system are the most widely used."

Mar 22 '16 22:03 ghost

So it won't be that simple :(

I guess best way would be to start with using 'unidecode', and then gradually add different handlers for different unicode ranges.

Mar 22 '16 22:03 Sinity

@Ezodev, can you try running some sample lyrics through unidecode to see if it gives reasonable results?

Mar 22 '16 23:03 sampsyo

it's not about ranges, it's about combinations, unidecode just won't do. You need to find something specific for japanese. I know libraries exist for this, but i'm not familiar with them. There are even different readings for combinations of hiragana and kanji vs kanji and kanji.

Mar 23 '16 01:03 ghost

So it sounds like there are lots of different options for the romaji transliteration. In that case, we have a bit more work to do to map out the exact set of options this functionality would need.

Mar 23 '16 01:03 sampsyo

there's really only one option which would be the romanization style, and it really doesn't have to be optional in the near term. The readings aren't something that are configurable, it just requires a library that does the right thing.

Mar 23 '16 03:03 ghost

Alternatively, you could feed it through the google translate api maybe?

Mar 23 '16 03:03 ghost

Got it; thanks for clarifying!

Mar 23 '16 05:03 sampsyo

@jrobeson "it's not about ranges, it's about combinations, unidecode just won't do. You need to find something specific for japanese. "

I meant that we could detect if a given character is within some range of Unicode codes, and based on that we would select the tool which would transliterate it.

Now I really think that it should be separate plugin after all. Because if someone wants to have lyrics in romaji, then he most likely want track titles and artists in romaji too.

@sampsyo

Is order of invocation of plugins controllable? Because that plugin should be invoked after, for example, lyrics and chroma plugin.

Mar 23 '16 09:03 Sinity

@Ezodev : I'd actually prefer to have lyrics in both romaji and original. I guess it helps that I know a little bit of japanese though :)

I don't generally transliterate my track titles either. I like them (mostly) the way they are.

Mar 23 '16 09:03 ghost

However, in an unrelated question.. is there some some way to have track titles available in both languages/scripts, with one being canonical? I guess i'm not that familiar with the id3 (or other similiar metadata) spec

Mar 23 '16 09:03 ghost

@jrobeson

Well, I don't know Japanese :D I can't really discern track titles when they're in their alphabet.

About both versions... I think that should be easily doable. Just add option to do that. For example, format with tags %o and %t - %o for "insert original here" and %t for "insert transliterated tag here". And select tags which would be transliterated. Then we could have track titles like {transliterated}({original}) or {transliterated} - {original}.

About having both versions in tags...Even if it is supported by specification, it wouldn't be supported by most music players anyway...

Mar 23 '16 10:03 Sinity

It would be good to always have the transliterated available in tags, if for nothing else than folder/file naming. Currently, if i processed my Yoko Kanno tracks with beets, i'd be unable to access them from the command line. I can't do a "beet ls ごちそうさん交響曲" on my querty keyboard even if i had any idea what to search for anyway.

Mar 23 '16 23:03 bearcatsandor

I've changed the title of the issue to reflect the desire to transliterate any kind of metadata.

Mar 24 '16 00:03 sampsyo

@bearcatsandor : you can if you installed an IME, but you'd definitely still have to know how 交響曲is phonetically pronounced.

Mar 24 '16 01:03 ghost

Throwing this in because it might be relevant.

MB handles transliteration of tracklists (artist + track titles) currently using pseudo releases. So original and transliterated spellings of a track end up in different releases. While there is an open issue for beets to work better with that (#654), it would probably be difficult to store both locally but nonetheless preferable over automagic stuff.

However, the May schema update for MB will change how this is supposed to work by introducing alternative tracklists. With this, one MB release can have many different tracklists, which can be used for alternative spellings of tracks, or translations/transliterations (http://tickets.musicbrainz.org/browse/MBS-4501). With all the information then available within one release, I suppose it would be easier to store it locally in beets. The only difficulty to figure out then is how to match all the alternative options to tags (i.e. which title to store in the ID3 title tag, Japanese, transliterated Romaji or translated English, etc).

Mar 24 '16 07:03 pprkut

that's really neat @pprkut

Mar 24 '16 07:03 ghost

About storing both transliterated and original metadata, maybe we could just use flexible attributes? For example, each track processed with this plugin would have normal 'title' tag, and additionally 'original_title' and 'transliterated_title'? Content of 'title' tag would be determined from format of some config option: main possibilities would be:

use transliterated title
use original title - that would still allow us to query stuff if we know what it is in transliterated form
use some combination of both: for example "{transliterated_tag} - {original_tag}"

Also, I think this plugin should be a bit more general than just Japanese -> romaji. As I said before, it could transliterate any 'special characters' to ASCII representation. We could have 'whitelist' option, for non-ASCII characters that should remain, so for example someone from Poland could leave their diacritics alone.

Then, when plugin processes given metadata, it does so char by char. For each character, it decides how it should handle it: default implementation would be to pass it through unidecode. For Japanese characters, it would grab all consecutive characters, and pass it though some specialized library. Other quirky symbols could be handled by yet another implementation.

I have a small question, are flexible attributes somehow written to files itself, or are they only present in the database?

Mar 24 '16 13:03 Sinity

Flexible attributes only live in the database.

Mar 24 '16 14:03 sampsyo

I like @Ezodev's idea of a "replace" plugin that replaces certain characters in a song's metadata. I suggested something similar here that could "fix" #1893.

Mar 24 '16 14:03 jackwilsdon

I just discovered a really nice library for this! https://pypi.org/project/pykakasi/

And an improved fork of unidecode that builds on it called unihandecode.

Oct 26 '20 02:10 ctrueden

beets beets copied to clipboard

Romaji transliteration for Japanese metadata

beets
beets copied to clipboard