WeTextProcessing.deprecated
WeTextProcessing.deprecated copied to clipboard
[RoadMap] development plan for Chinese inverse text normalization
Project Explanations:

- Following NeMo's
(1) classification+(2) verbalizationtwo-stage method, we plan to adapt jiayu's ITN grammar to this two-stage pipeline (for more details, plz see this paper).

-
The reasons why we choose to separate Chinese ITN into two stages (each stage has its own WFST) rather than transduce input text using a single WFST:
- WFSTs can only process input linearly, but the word order can change from spoken to written form (i.e. 三分之一 -> 1/3)
- English ITN grammars, which has been carefully designed in NeMo, can be seamlessly integrated into this project
Road Map:
- [x] Design semiotic-class for Chinese
- [x] Update Chinese ITN grammars from single-stage to two-stage
- [x] Simplify ITN related code of Sparrowhawk(C++) and migrate it to WeNet runtime
危楼高百尺,手可摘星辰。不敢高声语,恐惊天上人。 Seems great, I will learn the basic ideas at first.
semiotic classes:
| category | sub-category | example |
|---|---|---|
| number | int | 三十一 ==> 31 |
| float | 三十一点五七一 ==> 31.571 | |
| serial | 一一一二二二三三三 ==> 111222333 | |
| telephone | 加八六一八五四四一三九一二一 ==> +86-18544139121 | |
| - | - | - |
| electronic | IP | 二幺九点二二三点幺八四点二五二 ==> 219.223.184.252 |
| xyx艾特gmail点com ==> [email protected] | ||
| url | xyx点com ==> xyz.com | |
| - | - | - |
| fraction | fraction | 三分之一点二 ==> 1.2/3 |
| - | - | - |
| percent | percent | 百分之二点五 ==> 2.5% |
| - | - | - |
| measure | measure | 五点五美元 ==> 5.5$ |
| - | - | - |
| date | date | 二零二一年三月四日 ==> 2021年3月4日 |
| - | - | - |
| time | time | 下午三点十五分 ==> 3:15 pm |