text2gender
text2gender copied to clipboard
Predict the author's gender from their text.
Author gender classification from text.
Use at own risk, not well supported/documented project.
Trained on Reddit posts from r/AskMen and r/AskWomen. If I can say so myself, a clever, but abeit lazy way to get labelled data. Training was done on posts directly from those two subreddits, but this introduces its own set of biases. Maybe women who post on r/AskWomen write in a unique style inside of the subreddit, but not outside of it. To rectify this, you could instead find "women" users from the r/AskWomen, but look at their posts outside of r/AskWomen. Ideally, in a subreddit both men and women visit like r/AskReddit.
The accuracy rate must be further investigated for real world data.
length | accuracy | examples |
---|---|---|
< 250 | 67.56% | 48481 |
200 to 500 | 66.02% | 30715 |
500 to 1000 | 69.22% | 13600 |
1000 to 2000 | 72.99% | 3654 |
> 2000 | 76.96% | 599 |
- | - | - |
male below 250 | 65.98% | 23527 |
male 200 to 500 | 65.2% | 15275 |
male 500 to 1000 | 66.51% | 6346 |
male 1000 to 2000 | 69.99% | 1656 |
male above 2000 | 73.08% | 286 |
- | - | - |
female below 250 | 69.06% | 24954 |
female 200 to 500 | 66.83% | 15440 |
female 500 to 1000 | 71.59% | 7254 |
female 1000 to 2000 | 75.48% | 1998 |
female above 2000 | 80.51% | 313 |
Use
-
Install pipenv and learn how to use it.
-
Download required dependencies
pipenv install
-
Install required NLTK data.
pipenv run python3 -m textblob.download_corpora lite
-
Predict gender from piping in a text file. This should print out a 0 to 1 value. Male if above 0.5, otherwise female.
cat some_text.txt | pipenv run python3 predict.py
Train your own model (not required).
-
Install required developer dependencies. (also ensure you have sqlite3 installed)
pipenv install --dev
-
Install required NLTK data.
pipenv run python3 -m textblob.download_corpora lite
-
pipenv run python3 download.py
to download Reddit posts using the PushShift API. This goes on forever until your interrupt the process. I recommend around ~200k posts. The posts are saved todata.db
using sqlite3 under a "posts" table. -
Run
pipenv run python3 transform.py
to transform the posts into training data. Output will be stored indata.db
under theexamples
table. -
Run
pipenv run python3 generate_model.py
to train and test the model. The model weights will be saved todata/model_weights.json
anddata/model_biases.json
. -
Predict gender by piping in a text file.
cat some_text.txt | pipenv run python3 predict.py