compromise
compromise copied to clipboard
International firstname/lastname parser
ex --> "Jon doe is a geek" <---- if i use people() or topics() on this sentence then instead of returning jon doe it will return "null"
hey there, this is the result I get for this, on v5.1.0:
console.log(nlp.sentence(`"Jon doe is a geek`).people());
//[ Person {
text: 'Jon',
// ...
// }]
it also works on nlp.text(). Can you double-check this? cheers
hey there, thanks for a quick reply. I have checked it's my mistake actually i'm using Indian names since i am an Indian that's why it didn't work for me.
On Wed, Jun 1, 2016 at 2:02 AM, spencer kelly [email protected] wrote:
hey there, this is the result I get for this, on v5.1.0:
console.log(nlp.sentence(
"Jon doe is a geek
).people());//[ Person { text: 'Jon',// ...// }]it also works on nlp.text(). Can you double-check this? cheers
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nlp-compromise/nlp_compromise/issues/169#issuecomment-222810956, or mute the thread https://github.com/notifications/unsubscribe/AQIQE2gzDQtR83scAVIptAeQDLVF3kD_ks5qHJrQgaJpZM4IppJy .
ah, yeah sorry, I used an american name-frequency list, i think - just being lazy ;). we should begin augmenting it with something like this data.
i'll keep this open, while I check it out. thanks!
Welcome, and can you just reply me when you solve it? On Jun 1, 2016 8:42 PM, "spencer kelly" [email protected] wrote:
ah, yeah sorry, I used an american name-frequency list, i think - just being lazy ;). we should begin augmenting it with something like this data https://en.wikipedia.org/wiki/List_of_most_popular_given_names.
My feeling is that family-names in india in particular are easily-recognizable, or high-frequency, though I'm not sure why. Maybe that's something that would help as well. i'll keep this open, while I check it out. thanks!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nlp-compromise/nlp_compromise/issues/169#issuecomment-223014952, or mute the thread https://github.com/notifications/unsubscribe/AQIQEzSAv5atcHzZ3_Q-rWRZzNXvhYfkks5qHZqpgaJpZM4IppJy .
I haven't looked at the license, but I believe it's GNU lesser-license, so it should be Kosher. The latest version of the ANNIE plugin in the Gate Developer has a list of male and female first names, along with a list of ambiguous male and female first names.
That might be a good place to start.
https://gate.ac.uk/gate/plugins/ANNIE/resources/gazetteer/
- person_male.lst
- person_male_ambig.lst
- person_male_ambig_cap.lst
- person_male_cap.lst
- person_male_lower.lst
- person_male_lower_ambig.lst
LOVE IT! hey david, wanna work on them in v7? I've made a bunch of changes o'er there.
so i'm really excited about getting many of these in. We need to be careful about filesize, and ambiguous terms - more than gate does - but this data is excellent.
what do you wanna do? wanna split-up the work somehow?
--oooh, lets tag sports teams as organizations. I can add an extra pos-tag for Team
or something
yeah, let me take a crack at Organization. I'm on it.
I'll see if I have time. I'm going to take a crack at some documentation clean-up, next.
have gotten around to doing some of this today, it's a pretty massive project. mostly some compression/redundancy in the femaleName list, and an early lastname list. @kahwee has mentioned firstname/lastname order reversal in china/japan/korea - apparently in hungary too.
name-parsing is becoming a pretty-big part of this project, and we may need to do a 'nationality-tagging' of names, in order to properly parse them - bit scared of this http://crookedtimber.org/2004/03/01/whats-in-the-order-of-a-name/