compromise icon indicating copy to clipboard operation
compromise copied to clipboard

International firstname/lastname parser

Open DhruvSoni opened this issue 8 years ago • 10 comments

ex --> "Jon doe is a geek" <---- if i use people() or topics() on this sentence then instead of returning jon doe it will return "null"

DhruvSoni avatar May 30 '16 09:05 DhruvSoni

hey there, this is the result I get for this, on v5.1.0:

console.log(nlp.sentence(`"Jon doe is a geek`).people());
//[ Person {
    text: 'Jon',
// ...
// }]

it also works on nlp.text(). Can you double-check this? cheers

spencermountain avatar May 31 '16 20:05 spencermountain

hey there, thanks for a quick reply. I have checked it's my mistake actually i'm using Indian names since i am an Indian that's why it didn't work for me.

On Wed, Jun 1, 2016 at 2:02 AM, spencer kelly [email protected] wrote:

hey there, this is the result I get for this, on v5.1.0:

console.log(nlp.sentence("Jon doe is a geek).people());//[ Person { text: 'Jon',// ...// }]

it also works on nlp.text(). Can you double-check this? cheers

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nlp-compromise/nlp_compromise/issues/169#issuecomment-222810956, or mute the thread https://github.com/notifications/unsubscribe/AQIQE2gzDQtR83scAVIptAeQDLVF3kD_ks5qHJrQgaJpZM4IppJy .

DhruvSoni avatar Jun 01 '16 04:06 DhruvSoni

ah, yeah sorry, I used an american name-frequency list, i think - just being lazy ;). we should begin augmenting it with something like this data.

i'll keep this open, while I check it out. thanks!

spencermountain avatar Jun 01 '16 14:06 spencermountain

Welcome, and can you just reply me when you solve it? On Jun 1, 2016 8:42 PM, "spencer kelly" [email protected] wrote:

ah, yeah sorry, I used an american name-frequency list, i think - just being lazy ;). we should begin augmenting it with something like this data https://en.wikipedia.org/wiki/List_of_most_popular_given_names.

My feeling is that family-names in india in particular are easily-recognizable, or high-frequency, though I'm not sure why. Maybe that's something that would help as well. i'll keep this open, while I check it out. thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nlp-compromise/nlp_compromise/issues/169#issuecomment-223014952, or mute the thread https://github.com/notifications/unsubscribe/AQIQEzSAv5atcHzZ3_Q-rWRZzNXvhYfkks5qHZqpgaJpZM4IppJy .

DhruvSoni avatar Jun 01 '16 18:06 DhruvSoni

I haven't looked at the license, but I believe it's GNU lesser-license, so it should be Kosher. The latest version of the ANNIE plugin in the Gate Developer has a list of male and female first names, along with a list of ambiguous male and female first names.

That might be a good place to start.

https://gate.ac.uk/gate/plugins/ANNIE/resources/gazetteer/

  • person_male.lst
  • person_male_ambig.lst
  • person_male_ambig_cap.lst
  • person_male_cap.lst
  • person_male_lower.lst
  • person_male_lower_ambig.lst

davidbuhler-zz avatar Oct 06 '16 14:10 davidbuhler-zz

LOVE IT! hey david, wanna work on them in v7? I've made a bunch of changes o'er there.

so i'm really excited about getting many of these in. We need to be careful about filesize, and ambiguous terms - more than gate does - but this data is excellent.

what do you wanna do? wanna split-up the work somehow?

spencermountain avatar Oct 06 '16 14:10 spencermountain

--oooh, lets tag sports teams as organizations. I can add an extra pos-tag for Team or something

spencermountain avatar Oct 06 '16 14:10 spencermountain

yeah, let me take a crack at Organization. I'm on it.

spencermountain avatar Oct 06 '16 14:10 spencermountain

I'll see if I have time. I'm going to take a crack at some documentation clean-up, next.

davidbuhler-zz avatar Oct 07 '16 03:10 davidbuhler-zz

have gotten around to doing some of this today, it's a pretty massive project. mostly some compression/redundancy in the femaleName list, and an early lastname list. @kahwee has mentioned firstname/lastname order reversal in china/japan/korea - apparently in hungary too.

name-parsing is becoming a pretty-big part of this project, and we may need to do a 'nationality-tagging' of names, in order to properly parse them - bit scared of this http://crookedtimber.org/2004/03/01/whats-in-the-order-of-a-name/

spencermountain avatar Nov 28 '16 16:11 spencermountain