pythainlp icon indicating copy to clipboard operation
pythainlp copied to clipboard

Using adversarial texts for training text normalization algorithms

Open p16i opened this issue 5 years ago • 6 comments

Consider the following example:

Current PythaiNLP's text normalization relies heavily on rules, which are sufficient in some circumstances. However, consider the following example, it considerably fails

image.

Input: อัตราดอกเบ้ียเงนิฝากและเงนิกู้เพ่ิมขึน้ Expect: อัตราดอกเบี้ยเงินฝากและเงินกู้เพิ่มขึ้น

Would it be possible if we can train a ML model for text normalization? I think the approach is similar to what we did for thai2rom, which is a seq2seq model.

Speaking about training data, we might develop a probabilistic model that perturbed a given word according to to some rules, e.g. สระลอย. So, we can use it to generate the training data for our seq2seq normalization model.

From what I can see, consider that many Thai official documents are in PDF, this model will be very useful for preprocessing results from PDF parsing, which is typically not robust for cases such as สระลอย.

@c4n : do you think we can leverage what you've developed for https://github.com/c4n/Thai-Adversarial-Evaluation here?

p16i avatar Apr 02 '20 15:04 p16i

Hi guys, You can try using this code to generate adversarial texts.

I think the rules I've used are quite simple, and there are rooms for more rules/ adversarial texts generation techniques. Also, it would be great if we can collect real spelling errors. (for more info on Thai spelling errors, you can take a look at this article: "A Cognitive and Linguistic Analysis of Search Queries of an Online Dictionary:A Case Study of LEXiTRON" ). As of now, there aren't that many works done on Thai spelling errors. But I hope that there will be more papers on this subject soon.

You can use either an E2E model or a 2-stage (detection + correction) model to do the spelling correction task. You guys can consult the first-half of this lecture to begin with. Spelling Correction was a take-home midterm exam (Kaggle competition) at Chula this semester. Let me ask them, if I can share the link to the competition.

c4n avatar Apr 02 '20 16:04 c4n

Very interesting, will be very useful.

bact avatar Apr 04 '20 19:04 bact

@wannaphong what do you think we advertise this issue in the Thai NLP group? There might be students who're interested in working on the idea, as a course project.

p16i avatar Apr 05 '20 09:04 p16i

@wannaphong what do you think we advertise this issue in the Thai NLP group? There might be students who're interested in working on the idea, as a course project.

I think it is interesting. We may be able to present based on the problems encountered.

wannaphong avatar Apr 05 '20 14:04 wannaphong

@cstorm125 because I actually took the example from you, can you tell us a bit about your use cases?

p16i avatar Apr 05 '20 15:04 p16i

@heytitle I used to to clean up texts before putting them through embeddings. This might be a good research topic. For practical use, since we are leaning more towards subwords, it might not be as useful.

cstorm125 avatar Jun 13 '20 13:06 cstorm125