LatentDirichletAllocation
LatentDirichletAllocation copied to clipboard
Topic Detection on Steam Video Games Using Latent Dirichlet Allocation
Overview:
The goal of this project was to use the Latent Dirichlet Allocation model to perform topic detection on video games in the Steam library.
Steam is a video game marketplace with 125 million active users that hosts thousands of popular games for PC. Each game on Steam has a corresponding web page that contains a brief description of the title, some keywords, some technical specifications, and finally several user written reviews. Each page also provides a link to twelve games deemed similar by Steam's own recommendation system.
This inherent connectedness can be interpreted as a graph with games as nodes and links as edges. We can then scrape this graph using a breadth first search. Having obtained the HTML from each page, we do some simple filtering by eliminating all tags, and then extracting all alphabetic tokens and converting them to lowercase. We use a stoplist to eliminate commonly used words.
Once we have the tokens from the pages, we can do topic detection using the LDA model. In the LDA model, we take the approach that each word w in each document is related to some topic z, where there are K possible topics in total. For each document there is a distribution theta over the frequency of each topic within that document. For each topic, there is a distribution phi over the probability of any word given that topic. We place Dirichlet priors on phi and theta with hyperparameters alpha and beta.
In order to learn this model, we can use Gibbs Sampling. Gibbs Sampling is an inference technique that works by repeatedly randomly sampling the latent variables (z in this case), and estimating the parameters based on the sampling. In effect, it takes a random walk through the sample space, and after enough iterations, its sampling distribution will converge towards a steady state which best describes the true distribution of the data.
Results:
I obtained the following results on a dataset of 1247 titles, using 25 topics, hyperparameters 0.1 and 0.01 for alpha and beta respectively, and 1100 iterations of which 1000 were spent burning in the sampling chain to a stable distribution.
What you see below is a list for each topic of the words that are most likely to be generated by that topic, and another list for each topic of the documents (game pages on Steam) that had the highest proportion of that topic.
Common Words in Topics
Topic 0:
rpg
combat
character
world
quests
system
skills
quest
level
characters
magic
items
heroes
skill
party
Topic 1:
weapons
zombies
dead
guns
zombie
enemies
fps
survival
gun
weapon
shooter
kill
ammo
combat
map
Topic 2:
life
half
needed
wood
coo
counter
strike
source
need
valve
episode
radio
fire
backup
portal
Topic 3:
gb
hd
nvidia
dlc
intel
pack
geforce
radeon
core
amd
edition
ati
windows
series
ghz
Topic 4:
turn
based
strategy
tactical
combat
enemy
units
tactics
missions
campaign
space
games
battle
mission
board
Topic 5:
game
fixed
added
steam
update
release
issues
version
support
bugs
updates
games
fix
mode
comments
Topic 6:
review
helpful
funny
found
reviews
game
people
read
recommended
account
yes
record
products
posted
hrs
Topic 7:
game
time
don
make
ve
re
play
want
need
back
bad
start
ll
things
review
Topic 8:
war
units
strategy
total
battle
campaign
ai
ii
unit
battles
pack
empire
multiplayer
army
men
Topic 9:
levels
level
music
platformer
puzzle
soundtrack
platforming
jump
controls
world
hard
simple
challenging
indie
controller
Topic 10:
enemies
mode
game
action
fun
boss
characters
controller
fighting
attack
combat
character
arcade
enemy
level
Topic 11:
story
characters
adventure
point
character
games
click
voice
series
acting
art
dialogue
plot
game
original
Topic 12:
stealth
metro
funny
prison
shadow
cell
starve
clean
night
la
day
november
people
guards
blood
Topic 13:
sonic
tower
defense
civilization
strategy
games
civ
game
defenders
map
endless
towers
units
research
ai
Topic 14:
space
ship
ships
planet
add
combat
star
galaxy
planets
simulator
train
universe
system
build
fleet
Topic 15:
early
access
review
game
build
april
development
building
world
crafting
community
features
alpha
feedback
version
Topic 16:
hat
cards
card
play
deck
magic
free
pay
pvp
mmo
money
players
online
player
win
Topic 17:
enemy
spotted
city
cities
people
building
simulator
build
simulation
world
tropico
bridge
money
buildings
motion
Topic 18:
game
games
gameplay
good
great
pretty
graphics
feel
experience
find
nice
bit
interesting
player
reviews
Topic 19:
game
players
play
team
player
review
free
friends
multiplayer
people
fun
online
community
match
maps
Topic 20:
racing
cars
lego
car
batman
race
physics
driving
tracks
simulator
track
series
truck
drive
city
Topic 21:
dungeon
items
rogue
enemies
die
roguelike
fun
monsters
generated
run
loot
randomly
random
isaac
dungeons
Topic 22:
nope
horror
atmosphere
story
myst
experience
puzzles
dark
world
arcade
exploration
played
scary
explore
anarchy
Topic 23:
puzzles
puzzle
hidden
object
ve
games
story
solve
adventure
click
objects
post
gb
point
command
Topic 24:
original
wars
star
classic
doom
ii
jedi
ys
knight
force
years
great
played
edition
version
Documents with High Proportion of Topics
Topic 0:
Breath of Death VII
The Book of Legends
Cthulhu Saves the World
Divine Divinity
Sacred 2 Gold
The Incredible Adventures of Van Helsing
Sacred Gold
Divinity II: Developer's Cut
Asguaard
Titan Quest - Immortal Throne
Gothic II: Gold Edition
Knights of Pen and Paper +1 Edition
WAKFU
Torchlight II
The Incredible Adventures of Van Helsing II
Topic 1:
Rising Storm Game of the Year Edition
S.T.A.L.K.E.R.: Call of Pripyat
Dying Light
theHunter: Primal
Red Orchestra: Ostfront 41-45
Serious Sam 3: BFE
No More Room in Hell
National Zombie Park
This War of Mine
Enemy Front
Serious Sam Classic: The First Encounter
Contagion
Dead Pixels
Serious Sam Classic: The Second Encounter
Receiver
Topic 2:
Stronghold Crusader HD
Counter-Strike: Source
Half-Life: Blue Shift
Hatoful Boyfriend
Half-Life 2: Episode Two
Half-Life 2
Half-Life 2: Lost Coast
Half-Life 2: Episode One
Half-Life: Source
Half-Life
Half-Life: Opposing Force
Counter-Strike: Condition Zero
Counter-Strike: Global Offensive
Counter-Strike
Team Fortress Classic
Topic 3:
Street Fighter X Tekken
Call of Duty: Black Ops III
Toren
Ultra Street Fighter IV
Worms Revolution
Worms Reloaded
Legend of Kay Anniversary
The Witcher 3: Wild Hunt
F1 2013
Painkiller Hell & Damnation
Call of Duty: Modern Warfare 2
Worms Ultimate Mayhem
Might & Magic: Heroes VI
Transformers: Fall of Cybertron
Worms Clan Wars
Topic 4:
Mordheim: City of the Damned
Space Hulk
Space Hulk Ascension
WARMACHINE: Tactics
Warhammer 40,000: Regicide
Warhammer 40,000: Armageddon
Talisman: Prologue
Warhammer 40,000: Dawn Of War Winter Assault
Blood Bowl Legendary Edition
Frozen Synapse
Shadowrun: Dragonfall - Director's Cut
Bionic Dues
Chainsaw Warrior
Jagged Alliance 2 Gold
Hell
Topic 5:
Gang Beasts
GameGuru
Pool Nation
Castle Story
BeamNG.drive
Cities XXL
Tabletop Simulator
Universe Sandbox
CDF Ghostship
Planetary Annihilation
Audiosurf 2
Cities: Skylines
Just Cause 3
Spacebase DF-9
Trine 3: The Artifacts of Power
Topic 6:
Plug & Play
Floating Point
The Basement Collection
Super Crate Box
The Impossible Game
Jagged Alliance 2 Gold
Toki Tori
Faerie Solitaire
AudioSurf
Counter-Strike
LEGO Star Wars - The Complete Saga
Half-Life 2: Lost Coast
Universe Sandbox
Sonic the Hedgehog
AdVenture Capitalist
Topic 7:
Vindictus
Streets of Chaos
Always Sometimes Monsters
This War of Mine
Football Manager 2015
Zafehouse: Diaries
Windforge
SUNLESS SEA
The Age of Decadence
Fiesta Online NA
Wasteland 1 - The Original Classic
Broken Age
Elite: Dangerous
Firefall
MapleStory
Topic 8:
Empire: Total War
Crusader Kings II
Men of War: Assault Squad
Hearts of Iron III
Medieval II: Total War
Total War: Shogun 2 - Fall of the Samurai
Napoleon: Total War
Rome: Total War
Panzer Corps
Men of War
Europa Universalis III Complete
To End All Wars
Total War: ROME II - Emperor Edition
Commander: The Great War
Europa Universalis IV
Topic 9:
Giana Sisters: Twisted Dreams
BIT.TRIP Presents... Runner2: Future Legend of Rhythm Alien
Giana Sisters: Twisted Dreams - Rise of the Owlverlord
BIT.TRIP RUNNER
Fly'N
Electronic Super Joy
Fermi's Path
Super Puzzle Platformer Deluxe
Offspring Fling!
Soundodger+
BIT.TRIP BEAT
1001 Spikes
Dustforce DX
Beatbuddy: Tale of the Guardians
JumpJet Rex
Topic 10:
Divekick
Mitsurugi Kamui Hikae
Devil May Cry 3 Special Edition
Kung Fu Strike - The Warrior's Rise
GundeadliGne
Aqua Kitty - Milk Mine Defender
GIGANTIC ARMY
Megabyte Punch
Waves
One Finger Death Punch
Vanguard Princess
Devil May Cry 4
Phantom Breaker: Battle Grounds
Super Galaxy Squadron
Ultratron
Topic 11:
Blackwell Unbound
Blackwell Convergence
The Blackwell Legacy
Edna & Harvey: Harvey's New Eyes
Goodbye Deponia
Blackwell Deception
Dreamfall Chapters
Randal's Monday
Broken Sword 5 - the Serpent's Curse
Broken Sword: Director's Cut
Chaos on Deponia
The Next BIG Thing
Deponia
Monkey Island 2 Special Edition: LeChucks Revenge
The Cat Lady
Topic 12:
Viscera Cleanup Detail: Santa's Rampage
Surgeon Simulator 2013
Prison Architect
Viscera Cleanup Detail: Shadow Warrior
Five Nights at Freddy's 2
Viscera Cleanup Detail
Five Nights at Freddy's
TrackMania Nations Forever
Cook, Serve, Delicious!
100% Orange Juice
The Escapists
I am Bread
Tom Clancys Splinter Cell Blacklist
Octodad: Dadliest Catch
Little Inferno
Topic 13:
Sonic the Hedgehog 2
Sonic 3 and Knuckles
Sonic the Hedgehog
Sid Meier's Civilization: Beyond Earth
Civilization IV: Beyond the Sword
DG2: Defense Grid 2
Sid Meier's Civilization IV: Colonization
Dungeon Defenders
Endless Legend
Revenge of the Titans
Sonic CD
Sonic Generations
Sonic the Hedgehog 4 - Episode II
Sid Meier's Civilization V
Galactic Civilizations II: Ultimate Edition
Topic 14:
Train Simulator 2015
X2: The Threat
X Rebirth
Evochron Mercenary
Distant Worlds: Universe
X3: Reunion
Elite: Dangerous
Space Pirates and Zombies 2
X3: Terran Conflict
Galaxy on Fire 2 Full HD
Out There: Edition
X: Beyond the Frontier
Strike Suit Zero
Horizon
Kerbal Space Program
Topic 15:
Medieval Engineers
Rising World
Oort Online
Stranded Deep
Subnautica
FortressCraft Evolved!
Eden Star :: Destroy - Build - Protect
Savage Lands
GRAV
Predestination
Salt
Landmark
Blockscape
Xsyon - Prelude
Beasts of Prey
Topic 16:
Team Fortress 2
Magic: The Gathering - Duels of the Planeswalkers 2013
Magic 2014 Duels of the Planeswalkers
Kingdoms CCG
Nightbanes
Magic: The Gathering - Duels of the Planeswalkers 2012
SolForge
Magic 2015 - Duels of the Planeswalkers
Might & Magic: Duel of Champions
Pox Nora
Infinity Wars 2014: Animated Trading Card Game
Ragnarok Online 2
BloodRealm: Battlegrounds
Royal Quest
Battlegrounds of Eldhelm
Topic 17:
Battlefield 2: Complete Collection
Moonbase Alpha
Cities in Motion
Train Simulator: South London Network Route Add-On
Tropico 4: Steam Special Edition
The Race for the White House
Train Fever
Cities in Motion 2
Riding Star
Tropico 3 - Steam Special Edition
Masters of the World - Geopolitical Simulator 3
Democracy 3
Banished
SimCity 4 Deluxe Edition
Tropico 5
Topic 18:
Aquaria
Claire
Full Bore
Forward to the Sky
The Old City: Leviathan
The Swapper
NightSky
Brothers - A Tale of Two Sons
A Story About My Uncle
Mind: Path to Thalamus
FRACT OSC
Styx: Master of Shadows
Closure
Skyborn
NaissanceE
Topic 19:
Awesomenauts
Strife
Solstice Arena
HAWKEN
AirMech
Quake Live
Super MNC
Warface
Freestyle2: Street Basketball
AERENA - Masters Edition
Infinite Crisis
GunZ 2: The Second Duel
Blacklight: Retribution
Block N Load
Sins of a Dark Age
Topic 20:
Assetto Corsa
GRID Autosport
RaceRoom Racing Experience
Need For Speed: Hot Pursuit
Euro Truck Simulator 2
GRID 2
Euro Truck Simulator
DiRT Showdown
The Crew
Project CARS
LEGO Batman3: Beyond Gotham
Driver San Francisco
Copa Petrobras de Marcas
LEGO Batman 2 DC Super Heroes
Gotham City Impostors Free to Play
Topic 21:
Ziggurat
Rogue Legacy
Diehard Dungeon
The Binding of Isaac: Rebirth
Our Darker Purpose
Sword of the Stars: The Pit
Full Mojo Rampage
Legend of Dungeon
Dungeons of Dredmor
Enter the Gungeon
Hack, Slash, Loot
Saints Row: Gat out of Hell
A Wizard's Lizard
Dungeonmans
Overture
Topic 22:
Cry of Fear
Anarchy Arcade
Outlast
The Fall
Risk of Rain
Cylne
Among the Sleep
Slender: The Arrival
Amnesia: A Machine for Pigs
Neverending Nightmares
Claire
Botanicula
NaissanceE
Amnesia: The Dark Descent
Passing Pineview Forest
Topic 23:
STAR WARS Battlefront II
Enigmatis: The Ghosts of Maple Creek
Grim Legends: The Forsaken Bride
Time Mysteries 2: The Ancient Spectres
Nightmares from the Deep 2: The Siren`s Call
Nightmares from the Deep: The Cursed Heart
Time Mysteries: Inheritance - Remastered
9 Clues: The Secret of Serpent Creek
Nightmares from the Deep 3: Davy Jones
Demon Hunter: Chronicles from Beyond
Left in the Dark: No One on Board
Clockwork Tales: Of Glass and Ink
Grim Legends 2: Song of the Dark Swan
Enigmatis 2: The Mists of Ravenwood
Abyss: The Wraiths of Eden
Topic 24:
STAR WARS Jedi Knight - Mysteries of the Sith
STAR WARS Jedi Knight - Dark Forces II
STAR WARS Jedi Knight II - Jedi Outcast
STAR WARS Jedi Knight - Jedi Academy
STAR WARS - Dark Forces
STAR WARS - The Force Unleashed II
STAR WARS - The Force Unleashed Ultimate Sith Edition
STAR WARS Knights of the Old Republic II - The Sith Lords
Wolfenstein 3D
Deus Ex: Mankind Divided
Tomb Raider: Anniversary
Oddworld: Abe's Exoddus
STAR WARS Empire at War - Gold Pack
Baldur's Gate II: Enhanced Edition
Final DOOM
Log Likelihood
See the output files for specific learned parameter values.
Simulated Data
I also created a set of simulated data (named test) using some artificial hyperparameters 0.5 and 0.5, 5 topics, and a vocabulary consisting of the letters a through j. On a corpus of 1000 documents, each with 500 words, the model was able to successfully recover the true parameters of the model with 2-3 decimal places of accuracy after running for 1100 iterations with 1000 to burn in. See the sim_data.py file for the details of how the artificial data was generated, and the files true and test_out for the actual and recovered parameters.
Discussion:
We can see from the results that the Gibbs Sampler has done a pretty good job of uncovering some latent topics present in the Steam web pages. Note that not all of the topics necessarily correspond to genre; some appear to relate to elements of the reviews, and others are related to technical specifications. I'll now give a more detailed analysis of what I consider to be some of the most interesting and well formulated topics.
Topic 0:
This topic is largely about RPG games, and we can see that clearly from the words associated with it. Interestingly, most of the games in this topic seem to have a divine and/or gothic twist to them.
Topic 1:
This topic has grouped all of the zombie games. We can see strongly correlated words such as zombie, fps, ammo, etc. which we might expect. The games themselves fit this description for the most part, except for Rising Storm, which is a WWII game. Perhaps it's not all that strange as discussion of Nazi Zombies is much more prevalent on the internet that one might expect, and Rising Storm does in fact have several zombie themed mods.
Topic 2:
This topic contains all of the games produced by Valve, which is incidentally the company that created Steam. Valve games tend to have strong cult followings, so it makes sense that we would see them all appear together here. Also, since Valve created Steam, it is understandable that the games made by Valve would be among the more popular ones on Steam.
Topic 3:
Here we can see all of the words that are related to the system specifications required to play a game. Almost all of the terms are ones that you would expect to see in any discussion of PC specs. In this case, because the topic is not related to any particular genre, the games don't look very similar, but are probably all united in requiring very specific specifications to run.
Topic 5:
This topic describes any notes on the page related to updates, bugs, and patches. Much like topic 3, this has nothing to do with genre, but it's possible that a game with a high proportion of this topic is particularly buggy.
Topic 6:
I like this topic qutie a lot as it has nothing to do with the games. Instead, this topic captures all of the content about the gaming community, the reviews that people leave, their accounts, etc. Many of these terms are found in the template for all pages, so there's not much meaning that can be gleaned from the games correlated with it.
Topic 7:
This topic is peculiar as it consists entirely of short meaningless words. In fact, it looks as though this topic is a catchall for any generic tokens that don't really fit into any of the other topics. Effectively, these are words that should be in our stopword list, but aren't. Clearly terms like game and play will show up on almost any page, but they're not common enough in English in general that they were included in our stoplist.
Topic 11:
This topic is another one not direcly related to the genre, but rather to a specific component of the review. It specifically discusses the storyline of the game, the characters involved, and even the dialogue and voice acting. One could theoretically use information from topics like this to automatically segment reviews into chunks about story, gameplay, graphics, etc. and use that to perform summarization across multiple reviews.
Topic 19:
Games in this topic are primarily free online multiplayers. We can see from the relevant words that there's a strong emphasis on teams, people, and community. Interestingly, this is one of only three topics in which the word "fun" is among the top listed.
Topic 22:
These are the straight horror games, featuring dark, atmospheric worlds for the player to explore. Humorously, the word "nope" makes the top of the list. The games themselves are all classic PC horror titles including Outlast, Amnesia, and Slender.
Conclusions and Future Work
In summary, we can see that the topic detector did a fantastic job at uncovering many of the basic genres, as well as identifying specific elements of the reviews and descriptions common to all games. It might be interesting to explore splitting the single hidden topic variable into two: one for the genre of the game, and one for the component of the page/review.
As stated above, I think it would be also worthwhile to improve the stoplist to better reflect the particular nomenclature for describing video games. This is especially important as the LDA model does not have a notion of weighting based on frequency. More generally, it might be interesting to see how we could incorporate a weighting scheme into the model, as we already know from the HTML which words are keywords, which are from the official description, and other such categories.
Usage:
First run grab_info_steam.py to scrape the Steam library for data. You can terminate that process whenever you like, but I found that good results were obtained after scraping ~1200 titles, which may take close to half an hour depending on connectivity.
Then run run-LDA.sh to do inference on the LDA model using the collected data. In the results I have described, I ran it using K=25, alpha=0.1, beta=0.01, T=1100, and burnin=1000. The names of the input and output files are games and steam respectively.
You can then run analyze.py to print out a brief summary of the parameters that the model has learned. It takes as arguments the name file and output file which are called games-names and steam respectively by default.