context_semantic_axes icon indicating copy to clipboard operation
context_semantic_axes copied to clipboard

trafficstars

Discovering Differences in the Representation of People using Contextualized Semantic Axes

Spreadsheets containing vocabulary and subreddits.

Code

Meta

  • helpers.py: helper functions

Dataset

  • scrape_pushshift.py: for downloading all of Reddit
  • filter_reddit.py: creating the Reddit datasets
  • forum_helpers.py: organize forum data
  • gram_counting.py: count all unigrams and bigrams in dataset
  • count_viz.ipynb: verifying that our dataset matches patterns from Ribeiro et al.

Vocabulary

  • data_sampler.py: reservoir sampling examples for NER evaluation, for context-level manosphere analyses
  • evaluate_ner.py: evaluate based on human-annotated data
  • Some scripts from booknlp multilingual for running NER model on entire dataset
  • find_people.py: to read in NER output, inspect glossary words, and create spreadsheet for manual annotation
  • people_viz.ipynb: for examining vocab
  • lexical_change.py: for creating time series of words
  • k_spectral_centroid.py: for visualizing how words relate to waves of different communities
  • time_series_plots.ipynb: for examining time series for vocab
  • coref_forums.py, coref_reddit_control.py, coref_reddit.py, coref_dating.py: running coref on different forum/Reddit datasets
  • coref_job_files.py: creates job files for coref
  • coref_helper.py: analyzes coref output
  • coref_viz.ipynb: figuring out gender inference steps

Building and validating semantic axes

  • setup_semantics.py: finds occupation pages and creates WordNet axes
  • wikipedia_embeddings.py: getting adjective and occupation embeddings from wikipedia
  • axis_substitutes.py: getting "good" contexts for adjectives in Wikipedia sentences.
  • validate_semantics.py: functions for applying axes on occupation dataset (this contains functions for loading axes)
  • axes_occupation_viz.ipynb: evaluate axes on occupation data

wikipedia/substitutes/bert-default can be found here.

wikipedia/substitutes/bert-base-prob can be found here. You will need both this and bert-default since we backoff to bert-default for cases where words are split into wordpieces.

The z-scored versions of these vectors are much better than their original versions:

from validate_semantics import load_wordnet_axes, get_poles_bert, get_good_axes

axes, axes_vocab = load_wordnet_axes()
adj_poles = get_poles_bert(axes, 'bert-base-prob-zscore')
good_axes = get_good_axes() # get axes that are self-consistent

for pole in tqdm(adj_poles): 
    if pole not in good_axes: continue
    left_vecs, right_vecs = adj_poles[pole]
    left_pole = left_vecs.mean(axis=0)
    right_pole = right_vecs.mean(axis=0)
    microframe = right_pole - left_pole

Semantic differences and change

  • prep_embedding_data.py: prep data for getting embeddings
  • reddit_forum_embeddings.py: get term-level embeddings for Reddit/forums
  • apply_semantics.py: apply axes to Reddit and forum embeddings
  • semantics_viz.ipynb: visualizing semantic axes' output