NeMo-text-processing icon indicating copy to clipboard operation
NeMo-text-processing copied to clipboard

zh TN is very slow and bad accuracy

Open lifeiteng opened this issue 2 years ago • 10 comments

one simple zh-CN sentence costs 1.32 sec and the result is not right.

>python normalize.py --text="123" --language=en
INFO:NeMo-text-processing:one hundred and twenty three
WARNING:NeMo-text-processing:Execution time: 0.02 sec

>python normalize.py --text="我出生于1998年7月22日" --language=zh
INFO:NeMo-text-processing:我出生于1998年7月22日
WARNING:NeMo-text-processing:Execution time: 1.32 sec

>python normalize.py --text="I'm born in 22/3/1990" --language=en
INFO:NeMo-text-processing:I'm born in the twenty second of march nineteen ninety
WARNING:NeMo-text-processing:Execution time: 0.02 sec

lifeiteng avatar Oct 20 '23 06:10 lifeiteng

@BuyuanCui could you please take a look?

ekmb avatar Oct 20 '23 17:10 ekmb

This seems to be related to the existing TN bug. It was not able to process a whole sentence. It will be fixed with the PR that I'm working.

BuyuanCui avatar Oct 20 '23 17:10 BuyuanCui

@lifeiteng a few options to speed up:

  • use --cache_dir
  • use normalize_list() https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize.py#L75

ekmb avatar Oct 20 '23 17:10 ekmb

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 20 '23 01:11 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Dec 21 '23 01:12 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jan 28 '24 01:01 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Mar 01 '24 01:03 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Apr 03 '24 01:04 github-actions[bot]

这似乎与现有的 TN 错误有关。它无法处理整个句子。它将通过我正在工作的 PR 修复。

This seems to be related to the existing TN bug. It was not able to process a whole sentence. It will be fixed with the PR that I'm working.

Whether the relevant problem has been solved? There are still problems in version 0.3.0

lsrami avatar Apr 28 '24 16:04 lsrami

@BuyuanCui https://github.com/NVIDIA/NeMo-text-processing/pull/112

ekmb avatar Apr 30 '24 18:04 ekmb

I've found that the TN FST is slow regardless of language (English too). It is not very practical with large data even using multiprocessing (normalize_list()). Any other ways to speed it up?

riqiang-dp avatar May 06 '24 22:05 riqiang-dp

@riqiang-dp we recommend Sparrowhawk for deployment https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/text_normalization/wfst/wfst_text_processing_deployment.html

ekmb avatar May 07 '24 01:05 ekmb