context_semantic_axes
context_semantic_axes copied to clipboard
trafficstars
Discovering Differences in the Representation of People using Contextualized Semantic Axes
Spreadsheets containing vocabulary and subreddits.
Code
Meta
helpers.py: helper functions
Dataset
scrape_pushshift.py: for downloading all of Redditfilter_reddit.py: creating the Reddit datasetsforum_helpers.py: organize forum datagram_counting.py: count all unigrams and bigrams in datasetcount_viz.ipynb: verifying that our dataset matches patterns from Ribeiro et al.
Vocabulary
data_sampler.py: reservoir sampling examples for NER evaluation, for context-level manosphere analysesevaluate_ner.py: evaluate based on human-annotated data- Some scripts from booknlp multilingual for running NER model on entire dataset
find_people.py: to read in NER output, inspect glossary words, and create spreadsheet for manual annotationpeople_viz.ipynb: for examining vocablexical_change.py: for creating time series of wordsk_spectral_centroid.py: for visualizing how words relate to waves of different communitiestime_series_plots.ipynb: for examining time series for vocabcoref_forums.py,coref_reddit_control.py,coref_reddit.py,coref_dating.py: running coref on different forum/Reddit datasetscoref_job_files.py: creates job files for corefcoref_helper.py: analyzes coref outputcoref_viz.ipynb: figuring out gender inference steps
Building and validating semantic axes
setup_semantics.py: finds occupation pages and creates WordNet axeswikipedia_embeddings.py: getting adjective and occupation embeddings from wikipediaaxis_substitutes.py: getting "good" contexts for adjectives in Wikipedia sentences.validate_semantics.py: functions for applying axes on occupation dataset (this contains functions for loading axes)axes_occupation_viz.ipynb: evaluate axes on occupation data
wikipedia/substitutes/bert-default can be found here.
wikipedia/substitutes/bert-base-prob can be found here. You will need both this and bert-default since we backoff to bert-default for cases where words are split into wordpieces.
The z-scored versions of these vectors are much better than their original versions:
from validate_semantics import load_wordnet_axes, get_poles_bert, get_good_axes
axes, axes_vocab = load_wordnet_axes()
adj_poles = get_poles_bert(axes, 'bert-base-prob-zscore')
good_axes = get_good_axes() # get axes that are self-consistent
for pole in tqdm(adj_poles):
if pole not in good_axes: continue
left_vecs, right_vecs = adj_poles[pole]
left_pole = left_vecs.mean(axis=0)
right_pole = right_vecs.mean(axis=0)
microframe = right_pole - left_pole
Semantic differences and change
prep_embedding_data.py: prep data for getting embeddingsreddit_forum_embeddings.py: get term-level embeddings for Reddit/forumsapply_semantics.py: apply axes to Reddit and forum embeddingssemantics_viz.ipynb: visualizing semantic axes' output