nlpaug icon indicating copy to clipboard operation
nlpaug copied to clipboard

Guide for NER Augmentation

Open DecentMakeover opened this issue 5 years ago • 8 comments

Thanks for sharing your work, i could not find Any NLP Augmentation library other than this.

Will this Library help in augmenting NER data?

My data looks like this

Ryan B-PER
Dsouza B-PER
/DOB O
11/11/1997 B-DOB
/MALE O
22 B-NUM
56565 B-NUM

Thanks in advance

DecentMakeover avatar Aug 08 '19 10:08 DecentMakeover

This library does not support generate augmented data for NER problem yet.

I can enhance it if there are any research paper related this problem

makcedward avatar Aug 09 '19 16:08 makcedward

May be I can help , I have a custom data set for which I need to augmentations, may be you can include that in your library?

On 09-Aug-2019, at 10:18 PM, Edward Ma <[email protected]mailto:[email protected]> wrote:

This library does not support generate augmented data for NER problem yet.

I can enhance it if there are any research paper related this problem

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/makcedward/nlpaug/issues/19?email_source=notifications&email_token=AGD5QFYXJNSPIFNFQM3IJZ3QDWNWRA5CNFSM4IKIUBBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD37GFOQ#issuecomment-519987898, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGD5QF22EFZUNFBKJVIMYXDQDWNWRANCNFSM4IKIUBBA.

DecentMakeover avatar Aug 09 '19 16:08 DecentMakeover

Thanks for your contribution.

Please share corresponding papers to me. So, I can check out whether it can be supported or not.

makcedward avatar Aug 10 '19 16:08 makcedward

I'm really interested in this as well as I am trying to do NER with a limited data set. I'm not aware of any papers looking at this specifically, but I think it might be interesting to combine it with a data generating DSL like Chattete (I actually asked about the problems nlpaug tackles in this issue! https://github.com/SimGus/Chatette/issues/25)

I think a useful first step might be to just make the substitutions tag-aware, so that you aren't going to do a substitution that changes the tag or something. Potentially you might also want a flag which just prevents substitutions on tagged (i.e. not 'O') words altogether.

This of course presumes the existence of a labelled, if small, dataset, which I think is totally reasonable. I think combining context-aware vector substitutions with a DSL language, and maybe some gazetter pipelines to streamline external inputs, could be really powerful, and a cool project to work on if anyone is interested!

Zylatis avatar Nov 10 '19 03:11 Zylatis

@Zylatis Thank you for your input. DSL can be one of the solution for that. Will further design how can nlpaug support DSL.

Before that, you may consider to leverage "stopwords" attribute to simulate tag-aware behavior. You can change list of stopwords per augmentation.

import nlpaug.augmenter.word as naw
text = "Peter likes dogs"
aug = naw.ContextualWordEmbsAug()
aug.stopwords = ['Peter']
aug.augment(text)

makcedward avatar Nov 10 '19 07:11 makcedward

Hi,

even i was looking for this. the above code snippet is helpful for sure.

but there is another use case in which we might want to substitute NER tag with another word.

is there any example for this?

manishiitg avatar Jan 29 '20 05:01 manishiitg

This is a simple custom NER augmenter which might help

https://gist.github.com/manishiitg/8fd4209fcb3c6cb08ed34705c1f32c86

manishiitg avatar Jan 29 '20 07:01 manishiitg

Hi @makcedward @manishiitg , any recent improvements to create NER synthetic data.

Original_text=`My name is Pratik. I live in India'

Augmented can be:

  1. `My name is Jon. I live in U.S.A'
  2. 'My name is Manish. I live in China`

pratikchhapolika avatar Mar 03 '23 05:03 pratikchhapolika