meeteval Added pseudo alignment strategy based on phoneme duration

Hi guys,

Greetings from Brno. I am trying to add phoneme-based duration as an another pseudo-alignment word duration strategy. This could enable tcpWER for languages such as Japanese for which one character e.g. a kanji could be of much longer duration than others. Adding here also Alexander Polok as he is responsible for https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard @Lakoc

Jun 24 '25 11:06 popcornell

Hey @popcornell!

This is a good extension for evaluating such languages as Japanese.

I'm unsure about your choice for the interface. I'm not that happy to add a language argument to that many functions just for this one pseudo word level timestamp strategy. This would have to be added to the whole interface including api and the CLI. I prefer something like strategy='phoneme_based_jpn' or strategy=phoneme_based('jpn'), so that the interface doesn't change.

Jun 24 '25 12:06 thequilo

Hi, thanks for the PR. Having a phone based splitting sound great.

Is transphone some kind of standard or at least, the result is some kind of standard?

I am thinking, if transphone_phoneme_based would be a better name for that option. It is a bit lengthy, but tells the user, what is used and doesn't block the introduction of alternative phone based splitters.

Jun 24 '25 12:06 boeddeker

I'm unsure about your choice for the interface. I'm not that happy to add a language argument to that many functions just for this one pseudo word level timestamp strategy. This would have to be added to the whole interface including api and the CLI. I prefer something like strategy='phoneme_based_jpn' or strategy=phoneme_based('jpn'), so that the interface doesn't change.

I see arguments for both realizations. The language argument makes the code a bit easier, e.g., simple dict lookup to get the subsegment function, better cli help text and CLI value checking (not sure, if we have checks for this implemented). On the other hand, encoding the language into the strategy makes it more obvious, that only the strategy uses the language.

Since we have now an expert in this chat, maybe one other question first: Samuele, do you know, if splitting transcripts at whitespace is the correct implementation for languages like Japanese? For Japanese often the character error rate is reported, but I don't know if usually the tools are language aware or people prepare the transcript to use WER calculators. Depending on your answer, the language argument could be used at multiple positions.

Jun 24 '25 13:06 boeddeker

I agree, when the language argument is useful in other locations, like word splitting or normalization, it may be worth to add it to the interface.

Jun 24 '25 13:06 thequilo

For Japanese often the character error rate is reported, but I don't know if usually the tools are language aware or people prepare the transcript to use WER calculators. Depending on your answer, the language argument could be used at multiple positions.

yeah actually I was unsure about that too. I was assuming that one would split the reference before feeding to meeteval but yeah maybe the best way to handle this is to make it dependent on the language or have an additional argument. I am not sure if we should handle it depending on language though because the reference and/or system might have the whitespaces or may not.

Are you guys ok with another argument ? Like has_whitespaces: Optional[Bool] = Triue

Jun 24 '25 15:06 popcornell

@popcornell Do you have examples for the output of a Japanese ASR system? The guys from NTT said that CER is usually used instead of WER, which completely ignores whitespace and splits individual characters. In that case, we may want to add a time-constrained CER

Jun 25 '25 13:06 thequilo

For documentation:

Until now, we have no clear answer on the best way to support Japanese and Chinese (e.g., what typical system outputs look like).
- Supporting CER is probably the best option (e.g., removing whitespace and converting the string into a list of characters instead of the split call). The current python api already supports this implicitly, as the user can do the split manually.
We discussed CER, and as of now, we tend toward introducing unit='word' and unit='char' in the python api signature, and adding meeteval.cer as a CLI entry point.

Jun 27 '25 14:06 boeddeker

We discussed the following:

The language should be encoded in the strategy name. We found no other use case for a language argument than this alignment strategy. The distinction between CER and WER should be explicit and independent of the language. So, we want the language to be encoded in the alignment strategy key, like transphone_phoneme_based_jpn. For this, the pseudo_word_level_strategies should become a class so that it can split off the language and pass it on to the transphone library. Simlar to the normalizer in https://github.com/fgnt/meeteval/blob/main/meeteval/wer/normalizer.py.

Since there are potentially multiple libraries for obtaining phoneme durations, the package name should also be encoded in the strategy name.

CER in a different PR We'll do the split into characters in a different pull request. In that PR, we'll for now supply a short script that splits segments into words as a preprocessing so that the WER functions yield a CER.

@popcornell Are you willing to adjust the PR with the required changes or should we do it?

Jul 16 '25 12:07 thequilo

Hey guys, yeah I plan to adjust it. But I am currently busy in the JSALT I thought I would have more time. I can do it this weekend though.

Jul 16 '25 14:07 popcornell

@popcornell ping! Do you still plan to work on this? If not, I'd have some time.

Sep 18 '25 12:09 thequilo

Hey Thilo, currently still busy for ICLR...

Sep 21 '25 13:09 popcornell