gender icon indicating copy to clipboard operation
gender copied to clipboard

incorporate cross-lingual "40,000 names" library

Open mpadge opened this issue 5 years ago • 6 comments

Hi @lmullen, got a random Q for you. I recently needed a non-English name-to-gender categorizer and was surprised how few there were even for English. So I built a wrapper around some really impressive C code based on 40,000 names (that site is in German, but names and languages span most of the global north, including Asia). I've currently bundled the code into this repo, but suspect it would be useful as/in a package of its own right. And so: What would you think about me PR-ing the lot into this package?

You can check out the current readme for a demo, and easily take it for a spin yourself. (Note that it might not work because I have to figure out how to get a compiled object to read a statically-packaged external file. If it just returns "INTERNAL_ERROR", then change this line to "./inst/dict/nam_dict.txt", and you should be good.)

The obvious problem

Your package is currently quite specifically aimed at accessing and utilising a single source of historical data. What I'm proposing is quite a different kind of functionality, and one which would alter how the overall aim of this package would have to be positioned. It would then become more like genderizer, but I am definitely not going to PR my code into there, because that's a simple wrapper to a highly commerical api that only allows 1,000 free requests before requiring outrageous costs.

The primary benfits

  • This code is truly multi-lingual, covering at least all languages listed here; and
  • It's blindingly fast, as demo'd on current readme, at around 100,000 names per second.

Let me know what you think, and I'll completely understand if you'd rather not. I guess in that case I'd just spin up yet another R gender package, but hope not to have to do that.

mpadge avatar Jun 18 '19 14:06 mpadge

@mpadge I'd be glad to take a PR that incorporates this source. It would have to fit with the existing user API, though. If you poke around you'll see that the gender() function is a wrapper around more specific functions such as gender_ssa() that incorporate various data sources. (Including one for genderizer.) Do you think you could send a PR along those lines?

If it is entirely impossible to do so, e.g., if there is no date information and no obvious way to incorporate it into the existing API, then I think a separate function in this package could be work. But ideally that separate function would mimic the existing API to the extent that it makes sense to do so?

Does all that make sense? What do you think?

lmullen avatar Jun 18 '19 16:06 lmullen

Sounds good, let's go for it. It should be pretty straightforward to just add a new entry to gender(... method = <new_method>) and go from there. That would minimally just need an extended note to the effect that the "years" parameter has no effect for this method, and also that "countries" includes a heap more, but only for this method. Other than that, I envisage 3 initial problems:

  1. No idea what to call the method, since the library itself is helpfully called "0717-182", and the code itself just "gender". So we'll have to think of a better name, for which I'd suggest something like "40k", but open to any suggestions, particularly those that more clearly indicate that this adds non-English-speaking abilities to the package.
  2. ~~I have to solve the issue of linking a compiled source object in an R package with an external (non-compiled) object located elsewhere in the package. I'll start by asking on slack once I've played a bit more.~~ :ballot_box_with_check:
  3. There may be a killer stumbling block in this: The files contain a heap of characters rarely encountered in the English language, including all and every specially accented versions from every European language. They were encoded in ISO-8859-1, which I converted to UTF-8, but ... most modern compilers still flag warnings because these characters are not supposed to appear within code itself. These warnings may prevent this from ever being CRAN-acceptable, although of course I hope not. (And the characters can't be put anywhere else without necessitating some kind of of conversion to native.enc, which could potentially obliterate functionality, so they have to stay in the source files).

I'll report back soon ... :smile: :+1:

mpadge avatar Jun 18 '19 18:06 mpadge

Lincoln, just checking one further issue with you: The gender library returns only identified gender, so most columns of current return object from your gender() function will simply be empty - no proportions, and no years. Please let me know whether you'd be okay with that. The library nevertheless contains additional data quantifying the relative frequencies of names in each cultural domain, on a log-2 13-point scale where 10 equates to 2% of the population. I'm not sure how (or whether at all) that could or should be incorporated in output formats at present, but if you could see a use and/or place for such additional data, we can work out a way.

Additional important point for you to note before we proceed: The internal name dictionary is 4.1MB, bloating your current package out to about 4.4MB. That's still under the standard CRAN limit (which can be circumvented with justification anyway), but this nevertheless has the primary consequence that future developments of the package will have to ensure that it does not grow too much more in total size.

mpadge avatar Jun 21 '19 10:06 mpadge

Seems like it would be a square peg in a round hole to try to fit this new dataset into the existing gender(). What do you think about creating it as a separate function that imitates the existing API where appropriate but does its own thing as necessary?

In terms of space: can the heavy components be put into the ropensci/genderdata package? That package is hosted by rOpenSci and is installed the first time the user needs it. That makes sure we can stay under the CRAN limitations. You can see how that works here.

lmullen avatar Jun 24 '19 15:06 lmullen

2nd Q first: Yes, that should be possible, and would be a much better idea. It'll still have to go in inst, because it has to be packaged precisely as is so it can be called directly from C with no attempts to transform or translate (encodings). But that's all straightforward.

First Q: Yes, that also sounds easier to implement a separate function. Note also that this code has the ability to allocate singular genders to arbitrarily long sequences of text. This obviously only works when a single unambiguously gendered name is present, but is really useful for the case i needed it for of throwing thousands of street names at it to determine which are gendered, and what the genders are. So in that sense, the input data are also different, and that indeed extends flexibility beyond your gender() function alone.

Note also that I've been playing around with returning probabilities, but it's rather fiddly as the main C code is entirely built on bit-wise operations (which is why it is so fast!), and I have to modify the bit sequences to embed the extra info. I'm not sure this is really worth it - do you think the probabilities are actually used a lot? Or do you think that at least as a first cut it would suffice to just insert the direct functionality to allow genderizing arbitrary sequences of text in pretty much any arbitrary global language?

mpadge avatar Jun 24 '19 17:06 mpadge

As a first cut just predictions would be good, I think. But I do think the probabilities are useful.

lmullen avatar Jun 24 '19 21:06 lmullen