malaysian-dataset
malaysian-dataset copied to clipboard
preparing abstractive normalization
rules based normalization bahasa -> ms-en noisy trained translation -> standard en -> en-ms translation.
- rules based normalization bahasa from
malaya.normalize
. - ms-en noisy model, google translate is really good enough.
- en-ms translation, somehow based on test set, en-ms translation from malaya is more acceptable for us.