Digital-Humanities-Toolkit icon indicating copy to clipboard operation
Digital-Humanities-Toolkit copied to clipboard

I created this repository to provide the DH Community a compilation of free, open-source tools for creating and developing digital humanities projects, along with relevant tutorials and examples of pr...

Welcome to the Guide and Introduction to Digital Humanities Toolkit

Contents:

General Guides & Resources | Orientation on the Field | Getting Started | Creating and Maintaining an Online Scholarly Presence (Academic Blogging | Academic Twitter) | DH Project-Making | Digitization | Citing Digital Resources | Teaching Guides | The Job Market for DH Graduate Students | Evaluating DH Scholarship for Academic Promotion and Tenure

General Guides & Resources

Orientation on the Field (selected)

Getting Started with Digital Humanities

Creating and Maintaining an Online Scholarly Presence

DH Project-Making

Digitization

Tutorials for DH Tools and Methods

Contents:

General or Omnibus Technical Tutorials | Audio Editing | Code Versioning (Github) | Command Line (Bash / Powershell) | Content Management Systems | Curation | Data Science | Digitization | Linked Data | Mapping | Network/Social Network Analysis | Programming Languages Tools (Python * R) | Simulation| Text Analysis | Text Collation | Text Encoding | Text Preparation | Topic Modeling | Visualization

General or Omnibus Technical Tutorials

  • The Programming Historian 2 (" tutorial-based open access textbook designed to teach humanists practical computer programming skills that are immediately useful to real reasearch needs"; includes lessons on "Getting Started with Online Sources," "Working with Files and Web Pages," "From HTML to a List of Words," "Computing Frequencies," "Wrapping Output in HTML," "Keywords in Context (KWIC)," "Downloading Multiple Records Using Query Strings," "Automated Downloading with Wget," "Getting Started with Topic Modeling and MALLET")
  • DH Tools for Beginners ("collection of tutorials written for digital humanities novices")
  • M. H. Beals, "Introduction to Digital Analysis Techniques: A Workbook for History Modules" (A set of curricular modules, including how-to's and assignments, for text analysis and related work using the Library of Congress’s Chronicling America on-line newspaper database. Specific topics include: "using the Chronicling America database," "geo-visualisation," "topic modeling," reprint analysis," "discourse analysis") (direct link to the modules; PDF)
  • Shawn Graham, Ian Milligan, Scott Weingart, The Historian's Macroscope - working title. Under contract with Imperial College Press. Open Draft Version (Autumn 2013)
  • Michelle Moravec, "Five Steps To a Successful Digital History Project" (2014)
  • Lincoln Mullen, "How to Make Prudent Choices About Your Tools" (2013)
  • Miriam Posner, "How Did They Make That?" (2013) ("Many students tell me that in order to get started with digital humanities, they’d like to have some idea of what they might do and what technical skills they might need in order to do it. Here’s a set of digital humanities projects that might help you to get a handle on the kinds of tools and technologies available for you to use")
  • JISC Digital Media Guides ("Need help with using still images, sound and video for educational purposes? Explore our free digital media guides. They will take you through the process of finding, creating, managing, delivering and using digital media")
  • Tech Tips and Tools for Historians of Science (tips and tools for historians of science that are also useful for other humanities scholars)
  • UCLA Center for Digital Humanities, Intro to Digital Humanities (expansive resources based on Johanna Drucker and David Kim's DH 101 course at UCLA; includes concepts & readings, tutorials, exercises, student projects, and advanced topics)

Audio Editing Tutorials (see Audio Tools)

Code Versioning Systems (Git + Github) Tutorials (for project development) (see Code Versioning Tools)

Command Line Tutorials (see Command Line Tools)

Content Management Systems Tutorials (see Content Management Systems Tools)

Curation Tutorials (and Guides to Citing Datasets)

  • DH Curation Guide ("a community resource guide to data curation in the digital humanities" offering "concise, expert introductions to key topics, including annotated links to important standards, articles, projects, and other resources")
  • Joan Fragaszy Troyano, Guide to Curating Scholarship from the Open Web (2014), Parts I, II, III, IV Alex Ball and Monica Duke, "How to Cite Datasets and Link to Publications" (2011)

Data Science Tutorials

Digitization Tutorials

Linked Data Tutorials

Mapping Tutorials (see Mapping Tools)

Network Analysis Tutorials (see Network Analysis/Social Network Analysis Tools) (see also Visualization: Network Visualization Tutorials)

Programming Languages Tutorials (programming/scripting languages used to facilitate as text and data analysis, collection, preparation, etc.) (see Programming Languages Tools & Resources)

Simulation Tutorials (see Simulation Tools)

Text Analysis Tutorials (complemented by Text Preparation for Digital Work Tutorials below) (see Text Analysis Tools)

Text Collation Tutorials (see Text Collation Tools)

Text Encoding Tutorials (see Text Encoding Tools)

Text Preparation for Digital Work Tutorials (Text & Data "Wrangling": text harvesting, scraping, cleaning, classifying) (see Text Preparation Tools)

Topic Modeling Tutorials (see Topic Modeling Tools)

Visualization Tutorials (see Visualization Tools)

Digital Humanities Tools

Online or downloadable tools that are free, free to students, or have generous trial periods without tight usage constraints, watermarks, or other spoilers. Bias toward tools that can be run online or installed on a personal computer without needing an institutional server. (Also see Other Tool Lists)

Note about organization: At present, these tools are organized in an improvised scheme of categories. For the most deliberate and comprehensive taxonomy of digital-humanities activities, objects, and techniques currently available, see TaDiRAH. (See also about TaDiRAH)

Other Tool Lists:

Students may also be interested in online hosting services for their own domains or sites. Some providers offer suites of content management systems like WordPress,, Omeka, etc.. Providers include:

Animation & Storyboarding Tools

  • Bonsai (tool for programmatic creation of simple animated graphics in a Web browser using "graphics library which includes an intuitive graphics API and an SVG renderer")
  • FrameByFrame (stop-motion animation tool for Mac) ( creates stop-motion animation videos using any webcam/video camera connected to your Mac, including iSight)
  • Pencil (2D animation software suitable for beginners at animation)
  • Popcorn Maker (creates interactive videos; "helps you easily remix web video, audio and images into cool mashups that you can embed on other websites. Drag and drop content from the web, then add your own comments and links . . . ; videos are dynamic, full of links and unique with every view") | Tutorial by Miriam Posner
  • Scratch (visual programming platform developed by the MIT Media Lab to teach children about programming by allowing them to use a visual interface to create interactive programs, games, etc.; useful for allowing advanced humanities scholars without programming skills to program dynamic, interactive visual scenes and learn about programming logic) | Scratch 2.0 Offline Editor
  • Storyteller ("application from Amazon Studios that lets you turn a movie script into a storyboard. You choose the backgrounds, characters, and props to visually tell a story")

Audio Tools (see Audio Editing Tutorials)

  • Audiotool (free, web-based application for electronic music production; meant to serve as a fully functioning virtual studio. Users drop and drag synthesizers, drum machines, sequencers, filters, samples, and note sequences into the workspace from a toolbar)
  • Augmented Notes ("integrates scores and audio files to produce interactive multimedia websites in which measures of the score are highlighted in time with music")
  • MusicAlgorithms (tools and resources for the creation and analysis of algorithmically-generated music)
  • Paperphone ("interactive audio app that processes vocal performance in realtime, designed for presentations & sound essays")
  • Praat (free software package for phonetic analysis; designed to analyse, synthesize, manipulate, and visualize speech)
  • Sonic Visualiser (program to facilitate study of musical recordings; "of particular interest to musicologists, archivists, signal-processing researchers and anyone else looking for a friendly way to take a look at what lies inside the audio file")

Authoring/Annotation/Editing/Publishing Platforms & Tools (including collaborative platforms) (see also Content Management Systems and Exhibition/Collection/Edition Platforms & Tools)

  • Annotation Studio ("suite of tools for collaborative web-based annotation.... Currently supporting the multimedia annotation of texts... will ultimately allow students to annotate video, image, and audio sources")
  • Brat Rapid Annotation Tool ("online environment for collaborative text annotation"; focused on structured annotation of text, e.g., tagging named entities such as persons, organizations, etc., and their relationships) [CommentPress](http://www.futureofthebook.org/commentpress/) ("open source theme and plugin for the WordPress blogging engine that allows readers to comment paragraph-by-paragraph, line-by-line or block-by-block in the margins of a text. Annotate, gloss, workshop, debate: ... do all of these things on a finer-grained level, turning a document into a conversation")
  • Fold ("a context creation platform for journalists and storytellers, allowing them to structure and craft complex stories"; created at MIT Media Lab)
  • INKE Tools and Prototypes (tools and platforms developed by the INKE project)
  • Interactive Fiction Writing tools and platforms
    • Fungus (open-source "Unity 3D library for creating illustrated interactive fiction games")
    • Inklewriter (free online tool "designed to allow anyone to write and publish interactive stories. It’s perfect for writers who want to try out interactivity, but also for teachers and students looking to mix computer skills and creative writing"; "keeps your branching story organised, so you can concentrate on what’s important – the writing." Also allows export of stories to Kindle with hyperlinks for the interactive features of a story.)
    • Undum ("a game framework for building a sophisticated form of hypertext interactive fiction"; "consists of a HTML file [with CSS stylesheets] and three Javascript files... To create your own game, you edit the HTML file a little..., edit one of the Javascript files [and upload to a web server]") (sample story created in Undam: "Mrs. Wobbles and the Tangerine House," by the Marino Family)
  • NewRadial (visualization interface from the INKE project designed to facilitate studying, commenting on, and social editing of texts)
  • Odyssey (online tool that provides "a simple way for journalists, designers, and creators to weave interactive stories" based on a mapping paradigm; allows for mixing "written narrative, multimedia, and map based interaction")
  • Oppia (Google's tool for making "embeddable interactive educational 'explorations' that let people learn by doing"; "Oppia aims to simulate the one-on-one interaction that a student has with a teacher by capturing and generalizing 'interaction dialogues'"; explorations can contain maps, images, text)
  • Prism ("a tool for "crowdsourcing interpretation." Users are invited to provide an interpretation of a text by highlighting words according to different categories, or "facets." Each individual interpretation then contributes to the generation of a visualization which demonstrates the combined interpretation of all the users. We envision Prism as a tool for both pedagogical use and scholarly exploration, revealing patterns that exist in the subjective experience of reading a text.")
  • Pullquotes (tool for tweeting full quotations or images on Twitter, for tweeting a stream of quotations; and for collection Twitter quotations
  • Scalar (multi-modal authoring platform: "free, open source authoring and publishing platform that’s designed to make it easy for authors to write long-form, born-digital scholarship online. Scalar enables users to assemble media from multiple sources and juxtapose them with their own writing in a variety of ways, with minimal technical expertise required")
  • Scroll Kit (drag-and-drop online platform for creating scrollable multimedia narratives that also scale for mobile device screens; "Make stories people will want to touch. Scroll Kit is a powerful visual content editor . . . typography, images, motion")
  • StoryMapJS ("free tool to help you tell stories on the web that highlight the locations of a series of events; ... you can use StoryMapJS to tell a story with photographs, works of art, historic maps, and other image files. Because it works best with very large images, we call these 'gigapixel' StoryMaps")
  • Twine (" You don't need to write any code to create a simple story with Twine, but you can extend your stories with variables, conditional logic, images, CSS, and JavaScript when you're ready. Twine publishes directly to HTML, so you can post your work nearly anywhere.")

Code Versioning Systems (see Code Versioning Tutorials)

  • GitHub ("collaboration, code review, and code management for open source and private projects"; also used by scholars for non-code projects, e.g., creating documents, syllabi, or any project that benefits from tracking, forking, or roll-back of modular parts contributed by one or more participants)

Command Line Tools (see Command Line Tutorials)

  • The Sourcecaster (set of instructions for using the command line to perform common text preparation tasks--e.g., conversion of text or media formats, wrangling and cleaning text, batch filename editing, etc.)

Content Management Systems (see Content Management Systems Tutorials) (see also Authoring/Annotation/Editing Tools)

  • PBWorks (content management system hosted online with strong educational user base; particular robust as a wiki platform for project or course sites; free education-user licenses)
  • WordPress (content management system based originally on blog paradigm; hosted online or downloadable for installation on local server)

Crowdsourcing Tools

  • AllOurIdeas ("social data collection" wiki platform that solicits information online by survey "while still allowing for new information to 'bubble up' from respondents as happens in interviews, participant observation, and focus groups")

Exhibition/Collection/Edition Platforms & Tools (see also tools for Infographics and Timelines; and selected tools in Mapping)

  • CollectiveAccess ("cataloguing tool and web-based application for museums, archives and digital collections")
  • DH Press (WordPress-based "flexible, repurposable, extensible digital humanities toolkit designed for non-technical users. It enables administrative users to mashup and visualize a variety of digitized humanities-related material, including historical maps, images, manuscripts, and multimedia content. DH Press can be used to create a range of digital projects, from virtual walking tours and interactive exhibits, to classroom teaching tools and community repositories")
  • Exhibit (downloadable software for creating "web pages with advanced text search and filtering functionalities, with interactive maps, timelines, and other visualizations"; part of the Simile Widgets suite)
  • Google Open Gallery (Users must request an invite; "Powerful free tools for artists, museums, archives and galleries ... Easily upload images, videos and audio
    to create online exhibitions and tell your stories ... Enhance your existing website, or create a brand new one for free ... Very powerful zoom for your beautiful images... Help visitors discover your content using search and filtering options")
  • oldweb.today (online emulator platform from Rhizome that allows users to see what past or present web sites look like in historical browsers going back to the NCSA Mosaic browser)
  • Omeka ("create complex narratives and share rich collections, adhering to Dublin Core standards with Omeka on your server, designed for scholars, museums, libraries, archives, and enthusiasts"; hosted online or downloadable for installation on server) | Getting Started
  • Open Exhibits ("free multitouch & multiuser software initiative for museums, education, nonprofits, and students")
  • Neatline ("allows scholars, students, and curators to tell stories with maps and timelines. As a suite of add-on tools for Omeka, it opens new possibilities for hand-crafted, interactive spatial and temporal interpretation"; downloadable for installation on server)
  • Prezi (alternative to PowerPoint; uses an infinite canvas metaphor rather than a slide metaphor; free online production and viewing version; offline production version by subscription)
  • Silk (online data visualization and exhibition platform; takes datasets input as spreadsheets and allows users to create collections, maps, graphs, etc.)
  • Simile Widgets (embeddable code for visualizing time-based data, including Timeline, Timeplot, Runway, and Exhibition)
  • TextGrid ("a virtual research environment (VRE) for humanities scholars in which various tools and services are available for the creation, analysis, editing, and publication of texts and images"; provides "a variety of tested tools, services, and resources, allowing for the complete workflow of, for example, generating a critical textual edition"; "also supports the storage and re-use of research data through the integration of the TextGrid Repository")
  • ViewShare ("free platform for generating and customizing view--interactive maps, timelines, facets, tag clouds--that allow users to experience your digital collections"; upload spreadsheets or other collection data formats with information about a collection of materials; then configure how and what to show. Visualizations of collections are embeddable on Web pages. Users must request an account)

Internet Research Tools (tools for studying the Internet or parts of the Internet) (this section is heavily indebted to the Digital Methods Initiative at the University of Amsterdam and its collection of tools)

  • Censorship Explorer ("Check whether a URL is censored in a particular country by using proxies located around the world"; a Digital Methods Initiative tool)
  • Compare Lists ("Compare two lists of URLs for their commonalities and differences"; a Digital Methods Initiative tool)
  • Facebook (tools for studying Facebook)
    • Like Scraper ("For each URL entered, this script queries the Facebook api and retrieves the number of likes, shares, comments and clicks for given URLs. The output is a table with the URLs queried and the numbers retrieved"; a Digital Methods Initiative tool)
    • Netvizz ("Extracts various datasets from Facebook"; a Digital Methods Initiative tool)
    • NetvizzToSentiStrength ("uses Sentistrength to analyze the sentiment of short texts. Three types of data can be uploaded: Netvizz, DMI-TCAT, or a regular CSV file"; a Digital Methods Initiative tool)
  • Google (tools for studying Google)
    • Google AutoComplete (retrieves Google autocomplete suggestions according to language and country; a Digital Methods Initiative tool)
    • Google Blog Search Scraper (allows for batch queries of Google Blog Search; "query the resonance of a particular term, or a series of terms, in a set of blogs"; a Digital Methods Initiative tool)
    • Google Image Scraper ("query images.google.com with one or more keywords, and/or use images.google.com to query specific sites for images"; a Digital Methods Initiative tool)
    • Google News Scraper ("The scraper batch queries news.google.com, outputting a table of returns including URL, title, source, city/country, date and teaser text"; a Digital Methods Initiative tool)
    • GoogleScraper ("The Googlescraper ... queries Google and makes the results available for further analysis ... Google will be asked if each keyword occurs in each URL. Results are displayed as a tag cloud and an html table. They also are written to a text file which you can access at the bottom or through previous results ... The most common use of the tool is researching the presence as well as the ranking of particular sources within Google engine results"; a Digital Methods Initiative tool)
  • Harvester ("Extract URLs from text, source code or search engine results. Produces a clean list of URLs"; a Digital Methods Initiative tool)
  • Image Scraper ("scrape images from a single page"; a Digital Methods Initiative tool)
  • Internet Archive Wayback Machine Link Ripper ("Enter a host or URL to retrieve the links to the URL's archived versions at wayback.archive.org. A text file is produced which lists the archive URLs"; a Digital Methods Initiative tool)
  • IssueCrawler ("Enter URLs and the Issue Crawler performs co-link analysis in one, two or three iterations, and outputs a cluster graph....; also has modules for snowball crawling [up to 3 degrees of separation] as well as inter-actor crawling [finding links between seeds only]" ; a Digital Methods Initiative tool) (Instructions) (Auto-request a login)
    • Compare Networks Over Time ("Compares IssueCrawler networks over time, and displays ranked actor lists")
    • Extract URLs ("Extracts URLs from an Issuecrawler result file [.xml]; useful for retrieving starting points as well as a clean list of the actors in the network")
    • Issue Geographer ("Geo-locates the organizations on an IssueCrawler map, using whois information, and visualizes the organizations' registered locations on a geographical map")
    • Ranked Deep Pages from Core Issue Crawler Network ("Enter an IssueCrawler XML file and this script will get out all pages from the core network and rank those by pages by inlink count")
  • Issue Discovery Tool ("Enter URLs, and discover the most relevant words and phrases contained in them. One also may enter text, or an Issuecrawler result file (.xml)"; a Digital Methods Initiative tool)
  • iTunes Store Research Tool ("This tool queries http://itunes.apple.com/linkmaker/, retrieves all available results and outputs a csv file, as well as a gexf file [for visualization in Gephi] containing the relations between items in the iTunes stores and their categories"; a Digital Methods Initiative tool)
  • Language Detection ("Detects language for given URLs. The first 500 characters on the Web page(s) are extracted, and the language of each page is detected"; a Digital Methods Initiative tool)
  • LinkRipper ("Capture all internal links and/or outlinks from a page"; a Digital Methods Initiative tool)
  • Lippmannian Device ("device ... named after Walter Lippmann [that] provides a coarse means of showing actor partisanship"; a Digital Methods Initiative tool)
  • Lippmannian Device to Gephi ("visualize the output of the Lippmannian device as a network with Gephi"; a Digital Methods Initiative tool)
  • Open Calais ("Discovers the most relevant words and phrases among a set of websites, within a text, or within an issue network"; a Digital Methods Initiative tool)
  • Rip Sentences ("Enter a URL, and this script will split the text of the html page into sentences"; a Digital Methods Initiative tool)
  • ProfileWords (creates word clouds visualizing frequent words in the profile bios of Twitter users and the last 25 tweets of their followers and those they follow)
  • SentiStrength ("Automatic sentiment analysis of up to 16,000 social web texts per second with up to human level accuracy and 14 languages available - others easily added. SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language")
  • Source Code Search (loads a URL and searches for keywords + optional number of trailing characters in the page's source code [e.g., "cool" or with 5 trailing spaces, "cool cats"; a Digital Methods Initiative tool)
  • StoryTracker ("tools for tracking stories on news homepages"; includes ability to identify and track changing locations of stories on news site pages)
  • Table to Net ("Extract a network from a table. Set a column for nodes and a column for edges. It deals with multiple items per cell; by Médialab Sciences-Po")
  • Text Ripper ("Rip all non-html (i.e. text) from a specified page"; a Digital Methods Initiative tool)
  • Timestamp Ripper ("Rips and displays a web page's last modification date (using the page's HTML header). Beware of dynamically generated pages, where the date stamps will be the time of retrieval"; a Digital Methods Initiative tool)
  • TLD (Top Level Domain) Counts ("Enter URL's, and count the top level domains"; a Digital Methods Initiative tool)
  • Tracker Tracker ("The tool Tracker Tracker can be used to make (some parts of) the 'cloud' visible. The tool allows for the characterization of a set of websites or pages by detecting a set of 900+ predefined 'fingerprints' of cloud devices, including those that fall under the category of analytics, ad programs, widgets or social plugins, trackers, and privacy. Tracker Tracker may thus be used to gain an overall picture of detectable trackers or for a number of specified analytical purposes, such as social plugin detection, mapping 'power concentrations of the cloud' - mapping the political economy of the cloud, by looking at 'cloud technology'"; a Digital Methods Initiative tool)
  • Triangulation ("Enter two or more lists of URLs or other items to discover commonalities among them. Possible visualizations include a Venn Diagram"; a Digital Methods Initiative tool)
  • Twitter (tools for studying Twitter)
    • Twitter Advanced Search (Twitter's search interface for their complete index of historical tweets)
    • Hashalyzer (creates reports of Twitter hashtag participants and tweets)
    • MentionMap (online tool that shows an interactive social network graph of a Twitter user's mentions of other users; clicking on another user shows the mention map of that user
    • Mohio Social (shows in node-and-link style a user and the user's sphere of tweets, mentioned links, etc., each of which is clickable to go to the original tweet or linked document)
    • Twarc (Python "command line utility to archive Twitter search results as line-oriented-json")
    • Twiangulate (search for people followed in common by two people--i.e., the intersection of the sets of two people's follow lists)
    • Twitter Capture and Analysis Toolset (DMI-TCAT) (downloadable source code for tool that "captures tweets and allows for multiple analyses [hashtags, mentions, users, search, ...]." Due to Twitter's terms of service, the online version of this tool from the Digital Methods Initiative cannot be used by users unaffilliated with their program; but users may install the source code for themselves)
    • Twitonomy (online analytics platform for detailed study of users' Twitter activity, followers, mentions, retweets, hashtags, links, etc.; access to tracking statistics requires paid subscription)
    • Ernesto Priego, "Some Thoughts on Why You Would Like to Archive and Share [Small] Twitter Data Sets" (2014)
  • Wikipedia (tools for studying Wikipedia)
    • Wikipedia Cross-Lingual Image Analysis ("Insert a full Wikipedia URL ... and the tool will retrieve all language versions for the article. The tool will then scrape all the images of each language version and show them side by side in a table for comparison. The images retain the order in which they appear in the HTML"; a Digital Methods Initiative tool)
    • Wikipedia Edits Scraper and IP Localizer ("The tool scrapes the complete edit history for a specific Wikipedia page. When the tool finds an IP address instead of a user name it will use Maxmind's GeoCity Lite database to resolve the IP address to a geo-location"; a Digital Methods Initiative tool)
    • Wikipedia History Flow Companion ("The script chops Wikipedia edit histories in chronological chunks of 100 edits. It will display links which can be used to export those chunks from Wikipedia into IBM's History Flow Visualization"; a Digital Methods Initiative tool)
    • Wikipedia TOC Scraper ("Scrape Table of Contents for revisions of a Wikipedia page and explore the results by moving a slider to browse across chronologically ordered TOCs"; a Digital Methods Initiative tool)

Mapping Tools & Platforms (see Mapping Tutorials)

  • BatchGeo ("create Google maps from your data [in spreadsheet format] ... accepts addresses, intersections, cities, states, and postal codes")
  • CartoDB (online tools for visualizing and analyzing geospatial data; free plan includes up to 5 tables and 5Mb of data)
    • Torque for CartoDB ("efficient, fast, and styleable rendering method" to animate data on an interactive map; "see how your data has grown, moved, or changed over time and space")
      • "Create Your First Torque Visualization in Under a Minute" (video tutorial for Torque)
  • ChartsBin (creates interactive maps)
  • Clio (online tool that shows locations of historical interest in user's proximity; "Clio is an educational website and mobile application that guides the public to thousands of historical and cultural sites throughout the United States. Built by scholars for public benefit, each entry includes a concise summary and useful information about a historical site, museum, monument, landmark, or other site of cultural or historical significance. In addition, “time capsule” entries allow users to learn about historical events that occurred around them. Each entry offers turn-by-turn directions as well as links to relevant books, articles, videos, primary sources, and credible websites")
  • Esri Story Maps: Storytelling with Maps ("Story maps combine intelligent Web maps with Web applications and templates that incorporate text, multimedia, and interactive functions")
  • Flow Mapping with Graph Partitioning and Regionalization ("an integrated software tool to explore flow patterns in large spatial interaction data. It involves two packages: (1) GraphRECAP, which uses spatially constrained graph partitioning to find a hierarchy of natural regions defined by spatial interactions; and (2) FlowMap, which visualize flows based on the discovered regions and related attributes")
  • GeoExtraction (extracts geographical location from text; a Digital Methods Initiative tool)
  • Geo IP ("Translates URLs or IP addresses into geographical locations"; a Digital Methods Initiative tool)
  • Google Fusion Tables: create a fusion table and use the map chart type to map data with geographical information: instructions.
  • Google Earth
    • Google Lit Trips (site unaffiliated with Google that provides "free downloadable files that mark the journeys of characters from famous literature on the surface of Google Earth. At each location along the journey there are placemarks with pop-up windows containing a variety of resources including relevant media, thought provoking discussion starters, and links to supplementary information about 'real world' references made in that particular portion of the story. The focus is on creating engaging and relevant literary experiences for students." Includes documentation about how to make lit trips.)
  • Google Maps "My Maps" ("create and share maps of your world, marked with the locations, routes and regions of interest that matter to you")
  • Map Stack ("Assemble a selection of different map layers like backgrounds, satellite imagery, terrain, roads or labels! Tweak Photoshop-like controls like colors, masks, opacity and brightness to make a map your own! Share your map with a link or Pinterest or Tumblr")
  • MapStory (online platform for creating animated maps with "storylayers" of data "highlighting changes over time whether they be social, cultural or economic in nature")
  • Neatline ("allows scholars, students, and curators to tell stories with maps and timelines. As a suite of add-on tools for Omeka, it opens new possibilities for hand-crafted, interactive spatial and temporal interpretation"; downloadable for installation on server)
    • Bethany Nowviskie, et al. "Geo-Temporal Interpretation of Archival Collections with Neatline" [PDF] (2013)
  • Odyssey (online tool that provides "a simple way for journalists, designers, and creators to weave interactive stories" based on a mapping paradigm; allows for mixing "written narrative, multimedia, and map based interaction")
  • Power Map Preview for Excel (download) (tool from Microsoft Research Labs that allow users to generate from Excel spreadsheets map visualizations with geolocation, 2D and 3D data mapping, and interactive "video tours")
  • QGIS (downloadable open source GIS system positioned as alternative to the industry-standard, institutionally-priced ArcGIS tools; "Create, edit, visualise, analyse and publish geospatial information on Windows, Mac, Linux, BSD")
    • See the Geospatial Historian tutorial lessons on using QGIS for historical and other GIS mapping work.
  • StoryMapJS ("free tool to help you tell stories on the web that highlight the locations of a series of events; ... you can use StoryMapJS to tell a story with photographs, works of art, historic maps, and other image files. Because it works best with very large images, we call these 'gigapixel' StoryMaps"; requires a Google account and uses Google Drive as repository for user-provided photos to be shown on maps)
  • Thematic Mapping Engine (TME) ("enables you to visualise global statistics on Google Earth. The primary data source is UNdata. The engine returns a KMZ file that you can open in Google Earth or download to your computer")
  • Timemap ("Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously")
  • TimeMapper ("Elegant timelines and maps created in seconds")
  • WorldMap (open source platform "to lower barriers for scholars who wish to explore, visualize, edit, collaborate with, and publish geospatial information. WorldMap is Open Source software.... provides researchers with the ability to: upload large datasets and overlay them up with thousands of other layers; create and edit maps and link map features to rich media content; share edit or view access with small or large groups; export data to standard formats; make use of powerful online cartographic tools; georeference paper maps online...; publish one’s data to the world or to just a few collaborators")

Mind-Mapping Tools (Conceptualization Tools)

  • DebateGraph (collaborative mindmapping platform that allows individuals or groups to: facilitate group dialogue, make shared decisions, report on conferences, make and share posters, tell non-linear stories, explore the connections between subjects, etc.)

Network Analysis / Social Network Analysis Tools (see Network Analysis Tutorials) (see also Internet Research Tools, Tools for Studying Twitter, and Network Visualization Tools)

  • Gephi ("interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs")
    • (see Gephi Tutorials)
      • Convert Excel and CSV Files into Networks ("This plugin helps transform your Excel files or csv files into a network, directly imported into Gephi. You can choose which entities, and which relations, form the network")
  • Google Fusion Tables
    • Create a Google Fusion Table from a spreadsheet or csv file holding social network data, and select a chart type to visualize it as a network graph: instructions.
      • Other tutorials:
        • Timothy A. Lepczyk, "How to Create Network Graphs with Google Fusion Tables"
        • Iman Salehian (UCLA) & David Kim (UCLA), "Tutorial for Google Fusion Tables Network Graph" [PDF]
  • Jigsaw (downloadable platform for importing a variety of unstructured or structured documents; identifying entities (people, organizations, locations, dates, etc.); and visualizing relations between entities. Designed for investigating networks and clusters of relations implicit in large numbers of documents. Video tutorials | Instruction manual)
  • Local Wikipedia Map (online tool for visualizing networks of Wikipedia articles by choosing topics, filtering the resulting nodes [articles], and downloading and sharing the visualization; allows live access to the articles at nodes) (detailed instructions)
  • Netlytic ("cloud-based text and social networks analyzer that can automatically summarize large volumes of text and discover social networks from online conversations on social media sites such as Twitter, Youtube, blogs, online forums and chats")
  • Personae - A Character Visualization Tool for Dramatic Texts ("The aim of these visualisations is to use the XML files from the New Variorum Shakespeare edition of The Comedy of Errors to create a resource for exploring patterns of speeches by and mentions of characters in Shakespeare's work. Visualising the frequency, extent, and position of dialogue relating to a particular character presents users with a simple and immediate measure of that character’s prominence within the play. The tool enables users to select and visualise individual characters’ involvement, producing a novel means of exploring large-scale structural, narrative, or character-focused patterns within the text") (Github repository for the tool's code)
  • ProfileWords (creates word clouds visualizing frequent words in the profile bios of Twitter users and the last 25 tweets of their followers and those they follow)
  • TAGS v5.0 (Twitter Archiving Google Spreadsheet)
    • Stacy Blasiola, "Instructions for TAGS v5.0" (2013)
    • Example of TAGS v5.0 use by Lisa Marie Rhody: Tweet archive for #DH2013 (Digital Humanities 2013 conference) (2013)
  • Twitter Analysis (see Tools for Studying Twitter in Internet Research Tools section of this page)
  • UCINet for Windows ("software package for the analysis of social network data"; free trial for 90 days; discounted pricing for students & faculty)

Programming Languages Tools & Resources (programming/scripting languages and major program toolkits/packages used to facilitate text and data analysis, collection, preparation, etc.) (see Programming Languages Tutorials)

  • Python Tools & Resources (see Python Tutorials)
    • Python ("a clear and powerful object-oriented programming language" often used for text and data wrangling)
    • Beautiful Soup ("Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work")
    • IPython Notebook ("web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document.... These notebooks are normal files that can be shared with colleagues, converted to other formats such as HTML or PDF, etc. You can share any publicly available notebook by using the IPython Notebook Viewer service which will render it as a static web page")
  • "R" Tools & Resources (see ["R" Tutorials]http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244314/Tutorials%20for%20DH%20Tools%20and%20Methods#tutorials-r())
    • "R" ( R Project for Statistical Computing) ("language and environment for statistical computing and graphics.... provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ... and graphical techniques, and is highly extensible")
    • rOpenSci (workflow environment based on R that is designed for scientists but may be useful for other scholars working with processing and narrating data. "Use our packages to acquire data (both your own and from various data sources), analyze it, add in your narrative, and generate a final publication in any one of widely used formats such as Word, PDF, or LaTeX"; packages that allow access to data repositories through the R statistical programming environment [and] facilitate drawing data into an environment where it can readily be manipulated"; "analyses and methods can be easily shared, replicated, and extended by other researchers")
    • Stylo for R (computational stylistics methods implemented as R package; see how-to article [PDF]; warning: requires advanced knowledge)
  • FACTORIE ("toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating factor graphs, estimating parameters and performing inference")
    • Examples and instructions for using FACTORIE for Topic Modeling, Document Classification, and Natural Language Processing.
  • Web Browser Automation Tools (can be used for scripting web-scraping)
    • Google Chrome Scraper ("highlight a part of the webpage you'd like to scrape, right-click and choose "Scrape similar...." Anything that's similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs")
      • Tutorial: Jens Finnäs, "Get started with screenscraping using Google Chrome’s Scraper extension" (2012)
    • Selenium (browser-automation tool that can be used to create scripts and other automation for web-scraping)
  • Mashup Tools
    • Yahoo Pipes ("composition tool to aggregate, manipulate, and mashup content from around the web.... Simple commands can be combined together to create output that meets your needs: combine many feeds into one, then sort, filter and translate it; geocode your favorite feeds and browse the items on an interactive map....")

Simulation Tools & Platforms (see Simulation Tutorials)

  • NetLogo (downloadable software for agent-based simulations: "NetLogo is a programmable modeling environment for simulating natural and social phenomena. . . . NetLogo is particularly well suited for modeling complex systems developing over time. Modelers can give instructions to hundreds or thousands of independent 'agents' all operating concurrently. This makes it possible to explore the connection between the micro-level behavior of individuals and the macro-level patterns that emerge from the interaction of many individuals. NetLogo lets students open simulations and 'play' with them, exploring their behavior under various conditions. It is also an authoring environment which enables students, teachers and curriculum developers to create their own models. NetLogo is simple enough that students and teachers can easily run simulations or even build their own. And, it is advanced enough to serve as a powerful tool for researchers in many fields. NetLogo has extensive documentation and tutorials. It also comes with a Models Library, which is a large collection of pre-written simulations that can be used and modified. These simulations address many content areas in the natural and social sciences, including biology and medicine, physics and chemistry, mathematics and computer science, and economics and social psychology")
  • Second Life (general-purpose, Internet-based, immersive, 3D, and highly scalable (massively multi-user) "virtual world" where users can create an avatar, create richly rendered spaces and objects, and interact with each other as well as with various media sources)
  • SET (Simulated Environment for Theatre) ("3D environment for reading, exploring, and directing plays. Designed and developed by a multidisciplinary team of researchers, SET uses the Unity game engine to allow users to both author and playback digital theatrical productions")

Text Analysis Tools

(complemented by tools for Text Preparation for Digital Work, Topic Modeling Tools, and Text Visualization Tools below; see also the TAPoR 2 portal of text-analysis tools for an omnibus listing with reviews, ratings, difficulty levels, etc.) (see Text Analysis Tutorials). For some text-analysis tools, stop word lists are useful (lists of common words to ignore). Two common English-language stop lists are: Fox 1992 stop word list (429 words) | SMART 1971 stop word list (571 words)

  • AntConc ("concordance program developed by Prof. Laurence Anthony," with versions for Windows, Mac & Linux; site includes video tutorials)

    • Rudimentary instructions for using Antconc (Alan Liu)
    • Heather Froehlich, "Getting Started with AntConc" (tutorial)
    • Other tools to complement or extend AntConc also available on the AntConc site.
  • Bookworm ("Search for trends in 4.6M public domain texts from HathiTrust Digital Library")

    • Related Bookworm sites and interfaces:
      • Various Bookworms: (interface for Google Ngram visualization of trends in a select number of corpora: Open Library books; ArXiV science puplications; Chronicling America historical newspapers; US Congress bills, amendments, and resolutions; Social Science Research Network research paper abstracts)
      • Ben Schmidt,
        • In-browser Text Classification Using Bookworm ("This page automatically classifies a snippet of text (pasted into the text area below) against a bookworm database, so you can see how any given snippet lines up with the metadata you've defined for a collection.... If you have a Bookworm installation of your own, you can easily modify the code here to classify by whatever text variables you might have on hand")
        • Bookworm: Movies ("Search for trends in the dialogue of thousands of movie and TV shows, based on subtitles from Open Subtitles")
        • [Bookworm: Simpsons] ("Search across every word from 25 years of the Simpsons (at least, the ones that made it into closed captions) by episode, season, or even time within in the episode") (How the Simpsons bookworm was made)
      • FAQ and Guide to Making Bookworms (by Ben Schmidt)
      • Also see the following on the nature and limitations of the underlying Hathi trust corpus for Bookworm: David Mimno, "Word Counting, Squared" (2014).
  • CLAWS ("grammatical tagger that analyzes words in a text by part of speech. Based on the approximately 10 million words of the British National Corpus")

  • Concordance Programs (see Concordance Program Tutorials) - M-N. Lamy and H. J. Klarskov Mortensen, "Using Concordance Programs in the Modern Foreign Languages Classroom" (2012) (includes links to concordance programs) - AntConc ("concordance program developed by Prof. Laurence Anthony," with versions for Windows, Mac & Linux; site includes video tutorials) - Heather Froehlich, "Getting Started with AntConc" (tutorial)

  • Corpus Linguistics Programs/Resources (see Corpus Linguistics Tutorials (see also Corpora sets in Data Collections & Datasets)

    • U. Portsmouth, Online Corpus Linguistics Resources (including tools)
    • Wmatrix (corpus analysis and comparison tool providing "a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains"; 1-month free trial)
    • WordSimilarity (also known as Word 2 Word) (downloadable, java-based "open-source tool to plot and visualize semantic spaces, allowing researchers to rapidly explore patterns in visual data representative of statistical relations between words. Words are visualized as nodes and word similarities as directed edges of varying strengths or thicknesses.... system contains a large library of ready to use, modern, statistical relationship models along with an interface to teach them from various language sources"
  • DataBasic ("a suite of easy-to-use web tools for beginners that introduce concepts of working with data . . . WordCounter analyzes your text and tells you the most common words and phrases. . . . WTFcsv tells you WTF is going on with your .csv file. . . . SameDiff compares two or more text files and tells you how similar or different they are")

  • DPLA (Digital Public Library of America) Visual Search Prototype ("prototype visual search interface that explores content from the Digital Public Library of America. It is designed to provide an 'at-a-glance' visual overview of search results, and an intuitive means of narrowing the scope of the search")

  • Google Ngram Viewer (search for and visualize trends of words and phrases in the Google Books corpus; includes ability to focus on parts of the corpus [e.g., "American English," "English Fiction"] and to use a variety of Boolean and other search operators); see the related article: Jean-Baptiste Michel, Erez Lieberman Aiden, et al., "Quantitative Analysis of Culture Using Millions of Digitized Books" (2011)

    • See also: Bookworm (interface for Google Ngram visualization of trends in a select number of corpora: Open Library books; ArXiV science puplications; Chronicling America historical newspapers; US Congress bills, amendments, and resolutions; Social Science Research Network research paper abstracts
  • HathiTrust Research Center (HTRC) Portal (allows registered users to search the HathiTrust's ~3 million public domain works, create collections, upload worksets of datra in CSV format, and perform algorithmic analysis -- e.g., word clouds, semantic analysis, topic modeling) (Sign-up for login to HTRC portal; parts of the search and analysis platform requiring institutional membershkp also require a userid for the user's university)

    • Features Extracted From the HTRC ("A great deal of fruitful research can be performed using non-consumptive pre-extracted features. For this reason, HTRC has put together a select set of page-level features extracted from the HathiTrust's non-Google-digitized public domain volumes. The source texts for this set of feature files are primarily in English. Features are notable or informative characteristics of the text. We have processed a number of useful features, including part-of-speech tagged token counts, header and footer identification, and various line-level information. This is all provided per-page.... The primary pre-calculated feature that we are providing is the token (unigram) count, on a per-page basis"; data is returned in JSON format)
    • Tools and Tutorials related to using the HathiTrust Research Center:
      • Peter Organisciak and Boris Capitanu (in Programming Historian), "Text Mining in Python through the HTRC Feature Reader" (2016) )("We introduce a toolkit for working with the 13.6 million volume Extracted Features Dataset from the HathiTrust Research Center. You will learn how to peer at the words and trends of any book in the collection, while developing broadly useful Python data analysis skills")
  • IBM Watson User Modeling service ("uses linguistic analytics to extract cognitive and social characteristics, including Big Five, Values, and Needs, from communications that the user makes available, such as email, text messages, tweets, forum posts, and more; online demo site allows users to input text samples for analysis)

  • Lexos - Integrated Lexomics Workflow ("online tool ... to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations")

  • Macro-Etymological Analyzer (program by Jonathan Reeve that runs a frequency analysis of plain-text documents, looking up each word using the Etymological Wordnet, and tallying the words according to origin language family)

  • Named Entity Recognition (NER) Tools (see NER Tutorials)

    • NEX - Named Entity eXtraction (Web tool from dataTXT to identify names, concepts, etc. in short texts; also allows API access)
    • Stanford Named Entity Recognizer (NER) ("a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances")
  • New York Times "Chronicle" (use an interface similar to Google Books Ngram Viewer to explore the rise and fall in frequencies of words/phrases published in the New York Times. Instructions: first "clear graph"; then add one word or phrase to the graph at a time whose frequency you are interested in)

  • OpenCalais ("The OpenCalais Web Service allows you to automatically annotate your content with rich semantic metadata, including entities such as people and companies and events and facts such as acquisitions and management changes")

  • Overview (open-source web-based tool designed originally for journalists needing to sort large numbers of stories automatically and cluster them by subject/topic; includes visualization and reading interface; allows for import of documents in PDF and other formats. "Overview has been used to analyze emails, declassified document dumps, material from Wikileaks releases, social media posts, online comments, and more." Can also be installed on one's own server.) Personal-Nouns (Python scripts by Cory A. Taylor for generating list of "personal nouns" found in a text--i.e., nouns applying to persons such as "conscript, consecrater, conservator, consignee, consigner," etc. The GitHub site includes a list of personal nouns generated from the 1890 Webster's Unabridged Dictionary)

  • Poem Viewer ("web-based tool for visualizing poems in support of close reading")

    • Video describing the poetry visualization tool
    • Research paper on principles of Poem Viewer: A. Abdul-Rahman, et al., "Rule-based Visual Mappings - with a Case Study on Poetry Visualization"
  • Prospero ([documentation in French] text-analysis suite designed for humanists working with from historical and diachronic textual series; focused on exploring "complex cases") Prosodic ("a python script which performs two main functions: 1. annotating English and Finnish text for their phonological properties; 2. evaluating the relative metricality of lines of English and Finnish text") Prospect ("a sophisticated web-app implemented as a plugin for WordPress that enables users to collect and curate data and then enable the wider public to visualize and access that data. The graphical representation of data – whether it be geographical information shown on maps, temporal data shown on timelines, interpersonal relationships shown as connected graphs, etc. – can facilitate end-users in comprehending it quickly and analyzing it in domain-specific ways") (more detailed "About" page) Robots Reading Vogue (online tools from Digital Humanities at Yale University Library for datamining the archives of Vogue magazine; includes covermetrics, n-gram search, topic-modeling, and statistics for advertisements, circulation, etc.)

    • Vogue N-gram Search (use an interface similar to Google Books Ngram Viewer to explore the rise and fall in frequencies of words/phrases published in the Vogue magazine)
  • Sentiment Analysis (Useful cautionary critique of sentiment analysis: Sarah Kessler, "The Problem With Sentiment Analysis" (2014))

    • Sentiment Analysis (interactive demo plus information and research paper for the analysis of degrees of positive/negative "sentiment" in text passages based on an extensive "sentiment bank"; site includes downloadable dataset and code)
    • Sentiment140 ("allows you to discover the sentiment of a brand, product, or topic on Twitter"; "Our approach is different from other sentiment analysis sites because: we use classifiers built from machine learning algorithms. Some other sites use a simpler keyword-based approach, which may have higher precision, but lower recall. We are transparent in how we classify individual tweets. Other sites do not show you the classification of individual tweets and only show aggregated numbers, which makes it difficult to assess how accurate their classifiers are.")
    • Umigon ("sentiment analysis for tweets, and more")
  • Signature Stylometric System ("program designed to facilitate "stylometric" analysis and comparison of texts, with a particular emphasis on author identification")

  • Sketch Engine (subscription-based text analysis service with 1-month free trial; includes pre-loaded corpora in multiple languages and also allows users to create their own corpora from online sources; tools for building corpora, concordance search, thesaurus, word list, term extraction, parallel corpora; subscribers can access historical corpora such as EEBO/EECO).

  • TaPOR (Text Analysis Portal) (collection of online text-analysis tools--ranging from the basic to sophisticated)

    • TaPOR 2.0 (current, redesigned TAPoR portal; includes tool descriptions and reviews; also includes documentation of some historical or legacy tools)
  • Statistical Natural Language Parsers (Probabalistic Grammar / Syntax Parsers) (about statistical parsing)

    • Christopher D. Manning and Hinrich Schütze, "Foundations of Statistical Natural Language Parsers" (links to many parser tools)
    • Bikel Parser
    • MSTParser
    • OpenCCG Parser
    • RASP System
    • The Stanford Parser
  • Stylo for R (computational stylistics methods implemented as R package; see how-to article [PDF]; warning: requires advanced knowledge)

  • Text Mechanic ("A suite of simple, single task, browser based, text manipulation tools"--e.g., working with lines, words, spaces, etc. in texts)

  • Textometrica (web-based text analysis tool designed to "analyse large amounts of text in several different ways. For example, you can examine the frequency of individual words, see how often one term is linked to another, and see which words together form ideas and concepts in the text. Users can also create different visualisations and graphs from their text in order to gain a better overview of the structure of the text")

  • Textal ("free smartphone app that allows you to analyze websites, tweet streams, and documents, as you explore the relationships between words in the text via an intuitive word cloud interface. You can generate graphs and statics, as well as share the data and visualizations in any way you like")

  • TextPlot ("Texplot is a little program that turns a document into a network of terms that are connected to each other depending on the extent to which they appear in the same locations in the text")

  • twXplorer (online service that provides search tools for Twitter tweets, terms, links, and hashtags in relation to each other; provides a first-pass analytical view of a tweet or term, for example, in its relevant context)

  • TXM (Textométrie) ("The TXM platform combines powerful and original techniques for the analysis of large text corpora using modular components and open-source.... Helps users to build and analyze any type of digital textual corpus possibly labeled and structured in XML... Distributed as a Windows, Linux or Mac software application ... and as an online portal run by a web application")

  • VariAnt ("A freeware spelling variant analysis program for Windows" -- scroll down on this page for the download links)

  • Voyant Tools (Online text reading and analysis environment with multiple capabilities that presents statistics, concordance views, visualiztions, and other analytical perspectives on texts in a dashboard-like interface. Works plain text, HTML, XML, PDF, RTF, and MS Word files (multiple files best uploaded as a zip file). Also comes with two pre-loaded sets of texts to work on (Shakespeare's works and the Humanist List archives [click the “Open” button on the main page to see these sets])

    • Voyant "Get Started" Guide (Also see: Geoffrey Rockwell, "Introduction to Voyant", 2016)
    • Helpful tutorial from Pedagogy Toolkit with tips and examples for classroom use of Voyant
    • Voyant tool list
    • Voyant Tools Documentation
    • VoyantServer (downloadable Java-based version of Voyant Tools that can be installed and run locally on desktop or laptop computer)
  • Word and Phrase.info (powerful tool that allows users to match texts they enter against the 450-million word Corpus of Contemporary American English [COCA] to analyze their text by word frequencies, word lists, collocates, concordance, and related phrases in COCA)

  • Word2Vec ("deep-learning" neural network analysis tool from Google that seeks out relationships (vectors) between words in texts)

    • Explanations and discussions of the tool:
      • Google Open Source Blog: "Learning the Meaning Behind Words" (Aug. 2013)
      • Derrick Harris, "We're On the Cusp of Deep Learning for the Masses" (16 Aug. 2013)
  • WordHoard ("Powerful text-analysis tool for a select group of "highly canonical literary texts"--currently, all of early Greek epic (in original and translation), all of Chaucer and Shakespeare, and Edmund Spenser's Faerie Queene and Shepheardes Calendar"

  • Word Map (enter a word and visualize on a map its relation to equivalent words in different languages and nations around the world; "this experiment brings together the power of Google Translate and the collective knowledge of Wikipedia to put into context the relationship between language and geographical space")

  • WordSeer ("web-based text analysis and sensemaking environment for humanists and social scientists") (for full discussion of the site, see Aditi Muralidharan and Marti A. Hearst, "Supporting Exploratory Text Analysis in Literature Study," 2013 ) [paywalled])

  • Word Tree (generate word trees like those originally created for the ManyEyes visualization site from pasted-in text or from URL; example)

  • WordWanderer ("We are experimenting with visual ways in which we can enhance people's engagement with language. By fusing the information we can obtain from corpus searches, concordance outputs and word clouds we are aiming to enable and encourage people to notice and wander through the words they read, write and speak")

Text Collation Tools (see Text Collation Tutorials)

  • Juxta Commons ("a tool that allows you to compare and collate versions of the same textual work")
    • Juxta tutorials and resources: see Tutorials: Text Collation.
  • TRAViz (a JavaScript library that "generates visualizations for Text Variant Graphs that show the variations between different editions of texts. TRAViz supports the collation task by providing methods to: align various editions of a text; visualize the alignment; improve the readability for Text Variant Graphs compared to other approaches; interact with the graph to discover how individual editions disseminate")
  • Versioning Machine, version 4.0 ("a framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI) Guidelines")
  • Visualizing Variation ("code library of free, open-source, browser-based visualization prototypes that textual scholars can use in digital editions, online exhibitions, born-digital articles, and other projects. All of the visualization prototypes offered here deal with different aspects of the bibliographical phenomenon of textual variation: the tendency of words, lines, passages, images, prefatory material, and other aspects of texts to change from one edition to the next, and even between supposedly identical copies of the same edition. Variants are material reminders of the complex social lives of texts")
  • VVV (Version Variation Visualization) ("explore great works with their world-wide translations")

Text Encoding Tools (see Text Encoding Tutorials)

  • TEI Tools (tools page from the Text Encoding Initiative; includes tools for generate TEI schemas, convert to and from TEI documents, and stylesheets for converting TEI documents to HTML and other formats)
  • Music Encoding Initiative (MEI) (XML schemas and downloadable editing tool for text-encoding of music notation documents)
  • OpenCalais ("The OpenCalais Web Service allows you to automatically annotate your content with rich semantic metadata, including entities such as people and companies and events and facts such as acquisitions and management changes")
  • Oxygen XML Editor (free 30-day trial period)
  • XMLSpy (free 30-day trial period)

Text Preparation for Digital Work (Text & Data "Wrangling" Tools for Harvesting, Scraping, Cleaning, Classifying, etc.)

(see Text Preparation Tutorials) (see also Programming Languages Tools & Resources to facilitate wrangling)

  • The Sourcecaster (set of instructions for using the command line to perform common text preparation tasks--e.g., conversion of text or media formats, wrangling and cleaning text, batch filename editing, etc.)
  • AntFileConverter ("freeware tool to convert PDF files into plain text for use in corpus tools like AntConc" -- scroll down for the download links on this page)
  • BookNLP ("natural language processing pipeline that scales to books and other long documents in English, including: part-of-speech tagging ... dependency parsing ... named entity recognition ... character name clustering ... quotation speaker identification ... pronominal coreference resolution)
  • CSVkit ("suite of utilities for converting to and working with CSV, the king of tabular file formats")
    • See also the step-by-step tutorial "Eleven Awesome Things You Can Do With CSVKit"
  • Data Science Tool Kit (variety of tools for such purposes as converting/mapping street address to geographical coordinates, coordinates to political areas, coordinates to statistics, IP address to coordinates, text to sentences [i.e., removing boilerplate from text passages], text to sentiment, HTML to text, HTML to story, text to people, and files to text [e.g., PDF, Word docs, and Excel spreadsheets to text])
  • DataWrangler ("interactive tool for data cleaning and transformation"; suggests and facilitates restructuring, extraction, deletion, and other transformations of tabular and other structured data)
  • Import.io ("Turn any website into a table of data or an API in minutes without writing any code")
  • Jeroen Janssens, "7 Command-line Tools for Data Science" (2013) (tools for "obtaining, scrubbing, exploring, modeling, and interpreting data")
  • Lexos - Integrated Lexomics Workflow ("online tool ... to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations")
  • NameChanger ("Rename a list of files quickly and easily. See how the names will change as you type")
  • OpenRefine ("tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase")
    • Tutorials: "Introduction to OpenRefine" [PDF] and "Cleaning Data with OpenRefine" (5 Aug. 2013)
  • OutWit Hub (standalone program or Firefox extension for data extraction using Firefox: "contents extracted from a Web page are presented in an easy and visual way, without requiring any programming skills or advanced technical knowledge. Users can easily extract links, images, email addresses, RSS news, data tables, etc. from series of pages without ever seeing the source code. Extracted data can be exported to CSV, HTML, Excel or SQL databases, while images and documents, are directly saved to your hard disk"; paid "pro" version has more capabilities and capacity)
  • Overview ("automatically sorts thousands of documents into topics and sub-topics, by reading the full text of each one")
  • Pandoc ("If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in markdown, reStructuredText, textile, HTML, DocBook, LaTeX, MediaWiki markup, OPML, Emacs Org-Mode, or Haddock markup to HTML formats: XHTML, HTML5, and HTML slide shows using Slidy, reveal.js, Slideous, S5, or DZSlides; Word processor formats: Microsoft Word docx, OpenOffice/LibreOffice ODT, OpenDocument XML; Ebooks: EPUB version 2 or 3, FictionBook2; Documentation formats: DocBook, GNU TexInfo, Groff man pages, Haddock markup; Page layout formats: InDesign ICML; Outline formats: OPML; TeX formats: LaTeX, ConTeXt, LaTeX Beamer slides; PDF via LaTeX; Lightweight markup formats: Markdown, reStructuredText, AsciiDoc, MediaWiki markup, Emacs Org-Mode, Textile...")
  • pdf2htmlEX ("renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display")
  • PhoTransEdit ("Text to Phonetics online transcriber for turning English text into phonetic transcription using IPA symbols; also has free downloadable version)
  • Rip Sentences ("Enter a URL, and this script will split the text of the html page into sentences"; a Digital Methods Initiative tool)
  • Scan Tailor (" interactive tool for post-processing of scanned pages. It gives the ability to cut or crop pages, compensate for skew angle, and add / delete content fields and margins, among others. You begin with raw scans, and end up with tiff's that are ready for printing or assembly in PDF or DjVu file")
  • Scraper ("simple data mining extension for Google Chrome"; "to use it: highlight a part of the webpage you'd like to scrape, right-click and choose "Scrape similar...". Anything that's similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs")
  • ScraperWiki (free tools for scraping from Twitter and table extraction from PDF's)
  • ScraperWiki Classic (archive of user-created scraping tools for specific purposes and resources; includes resources and tutorials for creating your own scraper)
  • Scrapy (downloadable Python-based tool for "fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing")
  • Source Code Search (loads a URL and searches for keywords + optional number of trailing characters in the page's source code [e.g., "cool" or with 5 trailing spaces, "cool cats"; a Digital Methods Initiative tool)
  • Stanford Named Entity Recognizer (NER) ("a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances")
    • Michelle Moravec,, "How to Use Stanford's NER and Extract Results" (2014)
  • TET Plugin (Plugin for Adobe Acrobat designed to extract text from PDFs)
  • Text Ripper ("Rip all non-html (i.e. text) from a specified page"; a Digital Methods Initiative tool)
  • VARD 2 ("software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in Early Modern English texts. The tool is intended to be a pre-processor to other corpus linguistic tools such as keyword analysis, collocations," etc.)
  • Text Preparation "Recipes" for Topic Modeling Work:
    • Matthew Jockers
      • "'Secret' Recipe for Topic Modeling Themes" (guidance on creating stop lists, using parts-of-speech taggers to filter text, and "chunking" texts into suitable-length sections to optimize topic-modeling results)
      • "Expanded Stopwords List" ("Below is the list of stop words I used in topic modeling a corpus of 3,346 works of 19th-century British, American, and Irish fiction. The list includes the usual high frequency words (“the,” “of,” “an,” etc) but also several thousand personal names.")
    • Andrew Goldstone & Ted Underwood, "Code Used ... in Analyzing Topic Models of Literary-studies Journals" (GitHub repository of stoplist, code, and resources for Goldstone and Underwood's topic modeling project)

Topic Modeling Tools (complemented by Text Preparation "Recipes" for Topic Modeling Work above) (see Topic Modeling Tutorials)

  • DFR-Browser (browser-based visualization interface created by Andrew Goldstone for exploring JSTOR articles [facilitated by the JSTOR "Data for Research" (DFR) site through topic-modeling)
  • FACTORIE ("toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating factor graphs, estimating parameters and performing inference")
    • Examples and instructions for using FACTORIE for Topic Modeling, Document Classification, and Natural Language Processing.
  • Gensim ("free Python library: scalable statistical semantics, analyze plain-text documents for semantic structure, retrieve semantically similar documents")
  • Glimmer.rstudio.com Topic Modeling (LDA) visualization tool (allows users to upload their own data to generate scatterplots and bar charts)
  • In-Browser Topic Modeling ("Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations"; by David Mimno.) Note: the files for this tool can be downloaded and run locally; download from GitHub here.
  • LDAvis ("R package for interactive topic model visualization") (example of use)
  • MALLET
    • Mallet (MAchine Learning for LanguagE Toolkit)
      • GRMM (GRaphical Models in Mallet)
      • Programming Historian tutorial for installing and starting with MALLET
      • Latest version of Mallet in David Mimno's Github respository
  • MALLET-to-Gephi Data Stacker (online tool that takes "the '--output-doc-topics' output from MALLET and reorganize it into a format that Gephi understands")
  • The Networked Corpus ("a Python script that generates a collection of Web pages like the ones we have created for The Spectator.... designed to work with MALLET." The Networked Corpus project "provides a new way to navigate large collections of texts. Using a statistical method called topic modeling, it creates links between passages that share common vocabularies, while also showing in detail the way in which the topic modeling program has “read” the texts. ")
  • Stanford Topic Modeling Toolbox ("brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: _ Import and manipulate text from cells in Excel and other spreadsheets; _ Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text; _ Select parameters (such as the number of topics) via a data-driven process; _ Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data")
  • TMVE ("basic implementation of a topic model visualization engine")
  • Topic Modeling Tool (Java-based "graphical user interface tool for Latent Dirichlet Allocation topic modeling" by David Newman; comes with test input files [look in "Downloads" tab on site]. Input files should be in .txt files saved in same directory; the input files are formatted with returns between each separate document) (Note: this latest implementation of the Topic Modeling Tool is by Scott Enderle).
    • See also Miriam Posner, "Very basic strategies for interpreting results from the Topic Modeling Tool"
  • "Two Topic Browsers" by Jonathan Goodwin

Video Tools

  • "A Short List of Video Editing Applications" (JISC Guides)

Video & Film Analysis Tools

  • Cinemetrics (Frederic Brodbeck's project for "measuring and visualizing movie data, in order to reveal the characteristics of films and to create a visual 'fingerprint' for them. Information such as the editing structure, color, speech or motion are extracted, analyzed and transformed into graphic representations so that movies can be seen as a whole and easily interpreted or compared side by side"; includes downloadable code for Python script tools used to create the metrics)
  • ClipNotes ("designed for use with any film or video in which you want to quickly and easily retrieve selected segments and display them along with your notes or annotations.... you must first prepare an XML file which contains the starting and stopping times of the segments you wish to access, together with a caption to appear on a list, and any description or annotation you want to display along with the clip. Preparing an XML file is a remarkably easy procedure"; currently available as app for iOS and Windows 8.1, with Android app coming)
  • Film Impact Rating tool (provides a ranking of a film's impact based on a number of factors, including numbers of screenings, venues, receipts, review ratings, awards, etc. While designed for Australian film, the tool can be used for other films.)
  • Kinomatics Project ("collects, explores, analyses and represents data about the creative industries.... Current focus is on the spatial and temporal dimensions of international film flow and the location of Australian live music gigs"; also includes visualizations and tools for film impact rating)
  • YouTube Tools
    • YouTube Data Tools ("collection of simple tools for extracting data from the YouTube platform via the YouTube API v3. For some context and a small introduction, please check out this blog post. . . there is a FAQ section with additional information, and an introductory video") Thomas Padilla, "YouTube Data for Research" (includes tutorial, suggested command-line tools, and use case)

Visualization Tools (see Visualization Tutorials)

  • General or Multiple Purpose Viz Tools:

    • Better World Flux ("beautiful interactive visualization of information on what really matters in life. Indicators like happiness, life expectancy, and years of schooling are meaningfully displayed in a colourful flowing Flux.... visually communicates the world state in terms of standards of living and quality of life for many countries and how this has changed, and mostly improved, over a period of up to 50 years. This site is a tool for building a consensus, telling a story and sharing it, all whilst raising awareness for the UN Millennium Development Goals.")
    • Bonsai (tool for programmatic creation of simple animated graphics in a Web browser using "graphics library which includes an intuitive graphics API and an SVG renderer")
    • Chart and Image Gallery: 30+ Free Tools for Data Visualization and Analysis (gathering of tools by Sharon Machlis)
    • Circos ("software package for visualizing data and information ... in a circular layout ... ideal for exploring relationships between objects or positions")
    • D3.js ("a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS")
      • (see D3.js Tutorials)
    • GapMinder World (online or desktop data/statistics animation)
    • Gephi ("interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs")
      • (see Gephi Tutorials)
    • Google Fusion Tables (Google's "experimental data visualization web application to gather, visualize, and share larger data tables. Visualize bigger table data online; Filter and summarize across hundreds of thousands of rows. Then try a chart, map, network graph, or custom layout and embed or share it")
    • ImageJ (image processing program that can create composite or average images)
      • (see ImageJ Tutorials)
    • ImagePlot ("free software tool that visualizes collections of images and video of any size.... implemented as a macro which works with the open source image processing program ImageJ")
    • OpenHeatMap (creates "heat map" visualizations from spreadsheets)
    • Palladio (data visualization tool designed for humanities work; "web-based app that allows you to upload, visualize, and filter your data on-the-fly")
      • (see Palladio Tutorials)
    • Pixlr Editor (highly-capable, free, Photoshop-like photoeditor program that runs entirely in Flash in a browser; allows users to import images from local on online sources, edit, resize, crop, adjust image features, apply filters, etc.; Pixlr also has versions for mobile devices)
    • Processing ("Processing is a simple programming environment that was created to make it easier to develop visually oriented applications with an emphasis on animation and providing users with instant feedback through interaction. The developers wanted a means to “sketch” ideas in code. As its capabilities have expanded over the past decade, Processing has come to be used for more advanced production-level work in addition to its sketching role. Originally built as a domain-specific extension to Java targeted towards artists and designers, Processing has evolved into a full-blown design and prototyping tool used for large-scale installation work, motion graphics, and complex data visualization")
    • Prospect ("a sophisticated web-app implemented as a plugin for WordPress that enables users to collect and curate data and then enable the wider public to visualize and access that data. The graphical representation of data – whether it be geographical information shown on maps, temporal data shown on timelines, interpersonal relationships shown as connected graphs, etc. – can facilitate end-users in comprehending it quickly and analyzing it in domain-specific ways") (more detailed "About" page)
    • "R" ( R Project for Statistical Computing) ("language and environment for statistical computing and graphics.... provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ... and graphical techniques, and is highly extensible")
      • (see "R" Tutorials)
    • RAW (open web app that allows users to use a simple interface to upload data from a spreadsheet, choose and configure a vector graphics visualization, and export the results; built on top of the D3.js library)
    • Silk (online data visualization and exhibition platform; takes datasets input as spreadsheets and allows users to create collections, maps, graphs, etc.)
      • For a helpful quick-start guide (with sample dataset), see Miriam Posner, "Getting Started with Silk" (2016)
    • Tableau Public ("within minutes, our free data visualization tool can help you create an interactive viz and embed it in your website or share it")
      • Tableau Desktop (paid desktop version) | 1-year free license for students and instructors
    • Viewshare ("free platform for generating and customizing views (interactive maps, timelines, facets, tag clouds) that allow users to experience your digital collections")
    • Visualize Free (online visualization platform that allows uploading of datasets for multiple styles of graph-style visualization; "free visual analysis tool based on the advanced commercial dashboard and visualization software developed by InetSoft")
    • VisualSense ("interactive visualization and analysis tool ... developed for textual and numerical data extracted from image analysis of images from different cultures and influences")
    • WiGis ("visualization of large-scale, highly interactive graphs in a user's web browser. Our software is delivered natively in your web browser and does not require any plug-ins or add-ons. Our method produces clean, smooth animation in a browser through asynchronous data transfer (AJAX), and access to rich server side resources without the need for technologies such as Flash, Java Applets, Flex or Silverlight. We believe that our new techniques have broad reaching potential across the web")
    • yEd ("downloadable and online diagramming tools. Functions include the automatic layout of networks and diagrams: "the yFiles library offers the user many advantages, one of which is its ability to automatically draw networks and diagrams. yFiles layout algorithms enable the clear presentation of flow charts, UML diagrams, organization charts, genealogies, business process diagrams, etc.")
  • Diagramming & Graphing Tools:

    • aiSee Graph Visualization ("graphing program for Windows, Mac OS X, and Linux")
    • Gliffy (online diagramming and flow-charting
    • inzight (downloadable tool for Windows, Mac, Linus; "intelligently draws the appropriate graph depending on the variables you choose"; " automatically detects the variable type as either numeric or categorical, and draws a dot plot, scatter plot, or bar chart.")
    • yEd ("downloadable and online diagramming tools. Functions include the automatic layout of networks and diagrams: "the yFiles library offers the user many advantages, one of which is its ability to automatically draw networks and diagrams. yFiles layout algorithms enable the clear presentation of flow charts, UML diagrams, organization charts, genealogies, business process diagrams, etc.")
  • Image Tools:

  • 123D Catch (Autodesk's free phone app for creating 3D scans from photos taken of objects. 3D scans are created by taking many photos of an object from multiple sides and angles, then uploading to Autodesk for processing. Download the as phone app for Android, iOS; complemented by additional online and Windows software for editing and for 3D printing)

  • GIMP (powerful, free software for photo and image editing; runs on Macs, Windows, Linux, and other platforms)

  • Pixlr (Autodesk's online or offline image and photo editiing tool; Photoshop-like)

  • Infographics Tools:

    • Dorling Map Generator (creates Dorling maps--i.e., bubble maps of terms and values; requires user to auto-request an IssueCrawler login)
    • Infogr.am
    • Piktochart
    • PinWords ("instantly add beautiful text to your images")
    • ReciteThis (make infographics from quotes)
    • Sprites (tool for building animated infographics with HTML5 elements; freemium pricing)
    • Venngage
  • Network Visualization Tools (see also General or Multiple Purpose Viz Tools and Network Analysis/Social Network Analysis) (see Network Visualization Tutorials)

  • D3.js ("a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS")

  • Gephi ("interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs")

    • (see Gephi Tutorials)
  • MALLET-to-Gephi Data Stacker (online tool that takes "the '--output-doc-topics' output from MALLET and reorganize it into a format that Gephi understands")

  • NodeTrix ("a node-link diagram hybrid visualization with adjacency matrix"; see theory and research behind NodeTrix)

  • NodeXL ("free, open-source template for Microsoft® Excel® 2007, 2010 and (possibly) 2013 that makes it easy to explore network graphs. With NodeXL, you can enter a network edge list in a worksheet, click a button and see your graph, all in the familiar environment of the Excel window")

  • Textexture (online tool that allows users to "visualize any text as a network. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts (by clicking on the nodes), and find similar texts")

  • VOSviewer (Java-based software designed to "create maps based on network data," especially bibliometric networks, e.g., network maps of "publications, authors, or journals based on a co-citation network or to create maps of keywords based on a co-occurrence network")

  • yEd ("downloadable and online diagramming tools. Functions include the automatic layout of networks and diagrams: "the yFiles library offers the user many advantages, one of which is its ability to automatically draw networks and diagrams. yFiles layout algorithms enable the clear presentation of flow charts, UML diagrams, organization charts, genealogies, business process diagrams, etc.")

  • Text Visualization Tools (specialized text visualization tools, including word clouds, text difference, text variation) (see also Text Analysis Tools above)

  • History Flow Visualization (tool for visualizing the evolution of documents created by multiple authors) (download site for tool)

  • Textexture (online tool that allows users to "visualize any text as a network. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts (by clicking on the nodes), and find similar texts")

  • Word Cloud Tools (& Related Tools for Visualizing Terms Sized by Value)

    • Bubble Lines (create proportionately sized circles in SVG format by manually entering terms and values, e.g., Wordsworth (16) Keats (4) Byron (68); allows for manual input of terms and values; a Digital Methods Initiative tool)
    • Deduplicate ("Insert a tag cloud, e.g. war (5) peace (6) and the tool will write ouput 'war' five times and 'peace' six times-- Can be used to input preformatted tag clouds into services like wordle"; allows for manual input of terms and values; a Digital Methods Initiative tool)
    • Tag Cloud Generator ("Input tags and values to produce a tag cloud. Output is in SVG."; allows for manual input of terms and values; a Digital Methods Initiative tool)
    • Tagxedo (word cloud from multiple sources)
    • Wordle ("toy for generating 'word clouds' from text that you provide")
  • Word Tree (tool for online, interactive word trees for texts submitted by users)

  • Time Line Tools:

    • ChronoZoom (open-source project that allows users to create zoomable, Prezi-like timeline-history exhibitions "of everything," on various scales of time-space)
      • About
      • Guide to creating ChonoZoom timelines
    • Histropedia ("Discover a new way to visualise Wikipedia. Choose from over 1.5 million events to create and share timelines in minutes")
    • Simile Widgets (embeddable code for visualizing time-based data, including Timeline, Timeplot, Runway, and Exhibition)
    • Tiki-Toki (web-based platform for creating timelines with multimedia; capable of "3D" timelines)
    • Timeline Builder (online tool for building interactive Flash-based timelines from the Roy Rosenzweig Center for History and New Media)
    • Timeline JS (the Knight Lab's "open-source tool that enables anyone to build visually rich, interactive timelines. Beginners can create a timeline using nothing more than a Google spreadsheet.... Experts can use their JSON skills to create custom installations, while keeping TimelineJS's core functionality")
    • Timemap ("Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously")
  • Twitter Visualization Tools:

    • TAGSExplorer (step-by-step instructions with tools for archiving Twitter event hashtags and creating interactive visualizations of the conversations)
    • TweetBeam (creates "Twitter Wall" to "visualize the conversation around your event")
    • TweetsMap (analyzes and maps geographical location of one's Twitter followers)
    • Visible Tweets ("Visible Tweets is a visualisation of Twitter messages designed for display in public space")

"Deformance" Tools:

(While many tools can be used against-the-grain to "deform" materials for play or discovery, the following are tools expressly designed for this purpose. On "deformance" in the digital humanities, see for example Mark Sample, "Notes Towards a Deformed Humanities")

  • The Eater of Meaning ("tool for extracting the message from the medium. Format and presentation are unaffected, but words and letters are subjected to an elaborate nonsensification progress that eliminates semantics root and branch")
  • GIFMelter (creates dynamic, flowing distortions of online images)
  • Glitch Images (interactive interface with sliders to "glitch" imported .jpg images)
  • Image Distortion Tools
    • Distort Images (online tool that places grid over an image; dragging grid points distorts the image)
    • IMGonline.com.ua (online image distortion tool)
    • Photo-kako.com (online tool for applying effects, filters, and distortion effects to images)
  • Ivanhoe Game -- WordPress Theme version | more info about this version (requires Wordpress site installed on a local or institutional server) ("This tool is a vibrant reimagining of a game originally developed in the U. Virginia SpecLab. . . . The Ivanhoe Game can be played on any type of cultural object or topic. In Ivanhoe, players assume roles and generate criticism by pretending to be characters or voices relevant to their topic and making moves from those perspectives")
  • N + 7 Machine (English version only; "The N+7 procedure, invented by Jean Lescure of Oulipo, involves replacing each noun in a text with the seventh one following it in a dictionary")
  • Synonym Machine (set of Python scripts that download "famous works of literature and replaces specified parts of speech with random synonyms. The script is currently configured to do this with Moby Dick, in reaction to Robin Sloan's fascinating question: if you replaced every adjective with a close synonym, would it be fair to call this new text by the same title?")

Digital Humanities Examples

(Selected projects and writings chosen to provide beginners in DH with a blend of "best" and "doable" (also: technically advanced and less-advanced) aim points for their own work. The selection is biased toward the work of individuals or small teams, though some more extensive projects are included. Note: this selection is eclectic. It is not intended to be a comprehensive or proportionally accurate sample of DH across the disciplines.

Projects

King's College, London, Dept. of Digital Humanities

Census of digital humanities projects counted by historical period of the corpus of materials on which project is based (2014)

Binder, Jeff, and Collin Jennings

The Networked Corpus ("provides a new way to navigate large collections of texts. Using a statistical method called topic modeling, it creates links between passages that share common vocabularies, while also showing in detail the way in which the topic modeling program has “read” the texts. We are using the Networked Corpus to analyze earlier genres and concepts of topical knowledge from the development of commonplacing, anthologizing, and indexing in the early modern period through the nineteenth century"; provides Python script for others to use)

Brown, Vincent

Slave Revolt in Jamaica, 1760-1761 ("animated thematic map narrates the spatial history of the greatest slave insurrection in the eighteenth century British Empire")(2013)

Christie, Alex, et al.

Humanities on the Z-axis (from abstract of paper about the project: "Through a combination of techniques in three-dimensional (3D) fabrication, geospatial mapping, speculative computing, and pattern analysis, z-axis research expresses the geospatial narratives of modernist novels by geo-referencing them and then using that geo-data to transform base layers of maps from the modernist period. The output of the research includes warped, 3D maps of cities (e.g., Paris and Dublin) central to modernist literary production. These maps can be viewed as 3D models on a screen or as physical prototypes in hand, and they are currently being transformed using geo-data drawn from novels by Djuna Barnes, James Joyce, and Jean Rhys. Ultimately, they show how modernist authors wrote the city, and findings suggest they contradict existing research in modernist studies about how, exactly, cities are expressed in modernist novels.") See full paper (with illustrations of the project) by Alex Christie, et al. "Modeling How Modernists Wrote the City" (2014).

Des Jardin, Molly

"Geoparsing 19th-Century Travel Narratives" ("This project attempts to divide text narratives by location, and to find representative or key phrases within them. The corpus consists of 19th-century British travel narratives. The methodology uses a number of heuristics, most based on syntax, to identify when the narrator has arrived or is departing a location, and to identify that location. A simple methodology uses frequent nouns and sentences which contain both a frequent noun and an adjective to choose representative phrases, here used as a kind of sentiment analysis.")

Dorsey, Olivia

Franklin Memories: Preservation for a Lifetime (Omeka project built by an undergraduate to showcase the history of a small town in Macon County of North Carolina; "a website that takes the scanned photographs from the photo collections of many families in Franklin, NC. Feel free to click on the images above to start exploring exhibits that include the amazing photographs that I discovered this summer. The photos are organized first by collections, then by exhibits. The collections consist of every family who is included in the overall digital collection. The exhibits include time periods and other aspects of Franklin’s history that should be emphasized.")

Egan, Jim, and Jean Bauer

Mapping Colonial Americas Publishing Project

Emory U. Libraries Digital Scholarship Commons

"Lincoln Logarithms: Finding Meaning in the Sermons" ("We explored the power and possibility of four digital tools—MALLET, Voyant, Paper Machines, and Viewshare")

Fraas, Mitch

"Expanding the Republic of Letters: India and the Circulation of Ideas in the Late Eighteenth Century"(2013)

Ganahl, Simon, et al.

Campus Medius (see "About" essay: mapping project that "explores mediality as an experiential field by focusing on twenty-four hours in a metropolis. Following Mikhail Bakhtin, one might describe 'the day in the city' as a chronotope of the modernist novel—from Andrei Bely's Petersburg via James Joyce's Ulysses to Virginia Woolf's Mrs Dalloway. Our exemplary time-space on the weekend of May 13 and 14, 1933, in Vienna is marked by so-called Turks Deliverance Celebrations held by the paramilitary Home Guard and the Austrian National Socialists")

Goodwin, Jonathan

"Topics in Theory"(2012)

Healy, Kieran

"A Co-Citation Network for Philosophy" (2013) ("I took twenty years worth of articles from four major philosophy journals and generated a network from it based on the citations contained in those articles")

Kaufman, Micki

"Everything on Paper Will Be Used Against Me:" Quantifying Kissinger ("text analysis, visualization and historical interpretation of the DNSA Kissinger correspondence") (2014)

Kinomatics Team

Kinomatics: The Industrial Geometry of Culture ("collects, explores, analyses and represents data about the creative industries. Our research is collaborative and interdisciplinary. Our current focus is on the spatial and temporal dimensions of international film flow and the location of Australian live music gigs"; see also Deb Verhoeven, "Big Data at the Movies: The Kinomatics Project")

Moa, Belaid, and Jana Millar Usiskin

Making Models of Modernism (2014) (topic modeling Modernist literary works)

Mullen, Lincoln

Mapping the Spread of American Slavery

Pierazzo, Elena, and Julie André

"Around a Sequence and Some Notes of Notebook 46: Encoding Issues About Proust's Drafts"

Shayne, Liz

"Sefaria in Gephi: Seeing Links in Jewish Literature" (2014)

Wilhelm, Thomas, Manuel Burghardt, and Christian Wolff

To See or Not to See: An Interactive Tool for the Visualization and Analysis of Shakespeare's Plays (2013)

Essays and Books

Cordell, Ryan

"'Taken Possession of': The Reprinting and Reauthorship of Hawthorne's 'Celestial Railroad' in the Antebellum Religious Press" (2013)

Heuser, Ryan and and Long Le-Khac

"A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method" [PDF](2012)

Ted Underwood, Hoyt Long, and Richard Jean So

"Cents and Sensibility: Trust Thomas Piketty On economic Inequality, Ignore What He says About Literature" (2014)

Fagg, John, Matthew Pethers, and Robin Vandome

"Introduction: Networks and the Nineteenth-Century Periodical" [PDF] (2013) [paywalled]

Ross, Stephen, and Jentery Sayers

"Modernism Meets Digital Humanities" (2014) [paywalled]

Moretti, Franco

"Network Theory, Plot Analysis" [PDF] (2012)

Finn, Ed

"Revenge of the Nerd: Junot Díaz and the Networks of American Literary Imagination" (2013) ("My methodology in pursuing these claims is to define a framework for "the literary" in contemporary American fiction by asking how books are contextualized and discussed not just among critics and scholars but also among a general readership online. Digital traces of book culture (by which I mean user reviews, ratings and the algorithmic trails that our browsing and purchasing actions leave online) allow us to make claims about relatively large groups of readers and consumers of books, creating opportunities for the ‘distant reading’ of literary fame, but without losing the specificity of individual texts and authors.") | | Klingensteina, Sasa, Tim Hitchcock, and Simon DeDeo | "The Civilizing Process in London’s Old Bailey" [PDF] (2014)

Underwood, Ted, and Jordan Sellers

"The Emergence of Literary Diction" (2012)

Klein, Lauren F.

"The Image of Absence: Archival Silence, Data Visualization, and James Hemings" (2013) [paywalled]

Rhody, Lisa M.

"Topic Modeling and Figurative Language" (2012)

Rettberg, Jill Walker

"Visualising Networks of Electronic Literature: Dissertations and the Creative Works They Cite" (2014)
° Accompanied by Gephi dataset ("This is the 'clean,' unedited Gephi file I used to visualise the connections between 44 dissertations about electronic literature and the creative works they cite. This file is perfect if you want to download it, load it into Gephi and try visualising the data yourself. I provide a tutorial here: http://jilltxt.net/?p=3730")

Underwood, Ted, and Andrew Goldstone

"What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship?" (2013)

Jockers, Matthew L.

Macroanalysis: Digital Methods and Literary History (Champaign, IL: University of Illinois Press: 2013) Print.

Data Collections and Datasets

Starter Kit:

Demo Corpora (Small to moderate-sized text collections for teaching, text-analysis workshops, etc.)
Quick Start (Common text analysis tools & resources to get instructors and students started quickly)

Demo Corpora (Text Collections Ready for Use)

Demo corpora are sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc. Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file. (Contributions welcome: if you have demo collections you have created, please email [email protected].)
(Note: see separate section in the DH Toychest for linguistic corpora--i.e., corpora usually of various representative texts or excerpts designed for such purposes as corpus linguistics.)

Plain-text collections downloadable as zip files:

  • General Collections

  • Historical Materials

    • U.S. Presidents
      • U.S. Presidents' Inaugural Speeches (all 57 inaugural speeches from Washington through Obama collected from the American Presidency Project with the assistance of project co-director John T. Woolley; assembled as individual plain-text files by Alan Liu) (zip file)
      • Abraham Lincoln
        • Lincoln Speeches & Letters (84 works and excerpts from Project Gutenberg's version of Speeches and Letters of Abraham Lincoln, 1832-1865, ed. Merwin Roe, 1907; assembled by Alan Liu as separate files for each work) (zip file) (metadata)
    • DocSouth Data (selected collections from the Documenting The American South initiative at the University of North Carolina, Chapel Hill. Contains collections that have been packaged for text analysis. Each is a zip file in which a folder named "data" includes a "toc.csv" file of metadata and subfolders for both plain text and xml versions of the documents in the collection) (See additional literary materials in DocSouth Data)
    • The Church in the Southern Black Community
    • First-Person Narratives of the American South ("a collection of diaries, autobiographies, memoirs, travel accounts, and ex-slave narratives written by Southerners. The majority of materials in this collection are written by those Southerners whose voices were less prominent in their time, including African Americans, women, enlisted men, laborers, and Native Americans")
    • North American Slave Narratives ("The North American Slave Narratives collection at the University of North Carolina contains 344 items and is the most extensive collection of such documents in the world")
  • Michigan State University Libraries Text Collections

  • Literature

    • DocSouth Data (selected collections from the Documenting The American South initiative at the University of North Carolina, Chapel Hill. Contains collections that have been packaged for text analysis. Each is a zip file in which a folder named "data" includes a "toc.csv" file of metadata and subfolders for both plain text and xml versions of the documents in the collection) (see additional historical materials in DocSouth Data)
    • First-Person Narratives of the American South ("a collection of diaries, autobiographies, memoirs, travel accounts, and ex-slave narratives written by Southerners. The majority of materials in this collection are written by those Southerners whose voices were less prominent in their time, including African Americans, women, enlisted men, laborers, and Native Americans")
    • Library of Southern Literature
    • North American Slave Narratives ("The North American Slave Narratives collection at the University of North Carolina contains 344 items and is the most extensive collection of such documents in the world")
  • Fiction from the 1880s (sample corpora assembled from Project Gutenberg by students in Alan Liu's English 197 course, Fall 2014 at UC Santa Barbara) (zip files below)

  • Shakespeare plays (24 plays from Project Gutenberg assembled by David Bamman, UC Berkeley School of Information) (zip file)

  • txtLAB450 - A Multilingual Data Set of Novels for Teaching and Research (from the .txtLab at McGill U.: a collection of 450 novels "published in English, French, and German during the long nineteenth century (1770-1930). The novels are labeled according to language, year of publication, author, title, author gender, point of view, and word length. They have been labeled as well for use with the stylo package in R. They are drawn exclusively from full-text collections and thus should not have errors comparable to OCR’d texts." Download the plain-text novels as a zip file | Download the associated metadata as a .csv file.

  • William Wordsworth

    • (with Samuel Taylor Coleridge) Lyrical Ballads, 1798 (assembled by Alan Liu from Project Gutenberg as separate files for each poem and the "Advertisement") (zip file) (metadata)
    • The Prelude, 1850 version (William Knight's 1896 edition assembled by Alan Liu from Project Gutenberg as separate plain-text files for each book and cleaned of line numbers, notes, and note numbers) (zip file) (metadata)
  • Miscellaneous

    • Demo text collections assembled by David Bamman, UC Berkeley School of Information:
      • Book summaries (2,000 book summaries from Wikipedia) (zip file)
      • Film summaries (2,000 movie summaries from Wikipedia) (zip file)
    • U.S. patents related to the humanities, 1976-2015 (patents mentioning "humanities" or "liberal arts" located through the U.S Patent Office's searchable archive of fully-digital patent descriptions since 1976. Collected for the 4Humanities WhatEvery1Says project. Idea by Jeremy Douglass; files collected and scraped as plain text by Alan Liu)
      • Metadata (Excel spreadsheet)
      • Humanities Patents (76 patents related to humanities or liberal arts) (zip file)
      • Humanities Patents - Extended Set (336 additional patents that mention the humanities or liberal arts in a peripheral way--e.g., only in reference citations and institutional names of patent holders, as minor or arbitrary examples, etc.) (zip file)
      • Humanities Patents - Total Set (412 patents; combined set of above "Humanities Patents" and "Humanities Patents - Extended Set") (zip file)

Sites containing large numbers of books, magazines, etc., in plain-text format (among others) that must be downloaded individually:

Quick Start (Common Text-Analysis Tools & Resources)

A minimal selection of common tools and resources to get instructors and students started working with text collections quickly. (These tools are suitable for use with moderate-scale collections of texts, and do not require setting up a Python, R, or other programming-language development environment, which is typical for advanced, large-scale text analysis.)

Text-Preparation Tools

(Start here to clean, section, and wrangle texts to optimize them for text analysis)

  • Lexos (Included in the Lexos online text-analysis workflow platform are tools for uploading, cleaning [scrubbing] texts, sectioning texts [cutting or chunking], and applying stopwords, lemmatization, and phrase consolidations)
  • [See also fuller list of Text-Preparation Tools indexed on the tools page of DH Toychest]

Text Analysis Tools

  • AntConc (Concordance program with multiple capabilities commonly used by the corpus linguistics research community; with versions for Windows, Mac & Linux; site includes video tutorials. The tool can be used to generate a word frequency list, view words in a KWIC concordance view, locate word clusters and n-grams, and compare word usage to a reference corpus such as a corpus linguistics corpus of typical language use in a nation and time period)
  • Lexos (Online integrated workflow platform for text analysis that allows users to upload texts, prepare and clean them in various ways, and then perform cluster analyses of various kinds and create visualizations from the results; can also be installed on one's own server) (Note: Lexos has recently added the capability to import Mallet 'topic-counts' output files to visualize topics as word clouds, and also to convert topic-counts files into so-called "topic documents" that facilitate cluster analyses of topics. See Lexos > Visualize > Multicloud)
  • Overview (open-source web-based tool designed originally for journalists needing to sort large numbers of stories automatically and cluster them by subject/topic; includes visualization and reading interface; allows for import of documents in PDF and other formats. "Overview has been used to analyze emails, declassified document dumps, material from Wikileaks releases, social media posts, online comments, and more." Can also be installed on one's own server)
  • Voyant Tools (Online text reading and analysis environment with multiple capabilities that presents statistics, concordance views, visualizations, and other analytical perspectives on texts in a dashboard-like interface. Works with plain text, HTML, XML, PDF, RTF, and MS Word files [multiple files best uploaded as a zip file]. Also comes with two pre-loaded sets of texts to work on (Shakespeare's works and the Humanist List archives [click the “Open” button on the main page to see these sets])
  • Topic Modeling: (Currently, Mallet is the standard, off-the-shelf tool that scholars in the humanities use for topic modeling. [Specifically, Mallet is is a "LDA" or Latent Dirichlet Allocation topic modeling tool. For good "simple" explanations of LDA topic modeling intended for humanist and other scholars, see Edwin Chen and Ted Underwood's posts.] It is a command-line tool that requires students to fuss with exact strings of commands and path names. But the few GUI-interface implementations that now exist such as the Topic Modeling Tool do not allow for enough customization of options for serious exploration.)
    • Mallet (Download source for Mallet, which must be installed in your local computer's root directory. See the Programming Historian's excellent tutorial for installing and starting with Mallet.)
  • [See also fuller list of Text-Analysis Tools indexed on the tools page of DH Toychest]

Stopwords

(Lists of frequent, functional, and other words with little independent semantic value that text-analyis tools can be instructed to ignore--e.g., by loading a stopword list into Antconc [instructions, at 8:15 in tutorial video] or Mallet [instructions]. Researchers conducting certain kinds of text analysis (such as topic modeling) where common words, spelled-out numbers, names of months/days, or proper nouns such as "John" indiscriminately connect unrelated themes often apply stopword lists. Usually they start with a standard stopword list such as the Buckley-Salton or Fox lists and add words tailored to the specific texts they are studying. For instance, typos, boilerplate words or titles, and other features specific to a body of materials can be stopped out. For other purposes in text analysis such as stylistic study of authors, nations, or periods, common words may be included rather than being stopped out because their frequency and participation in patterns offer meaningful clues.)

  • English:
    • Buckley-Salton Stoplist (571 words) (1971; information about list)
    • Fox Stoplist (421 words) (1989; information about list)
    • Mallet Stoplist (523 words) (default English stop list hard-coded into Mallet topic modeling tool; stoplists for German, French, Finnish, German, and Japanese also included in "stoplists" folder in local Mallet installations) (Note: Use the default English stoplist in Mallet by adding the option "--remove-stopwords " in the command string when inputting a folder of texts. To add stopwords of your own or another stopwords file, create a separate text file and additionally use the command "--extra-stopwords filename". See Mallet stopwords instructions)
    • Jockers Expanded Stoplist (5,621 words, including many proper names) ("the list of stop words I used in topic modeling a corpus of 3,346 works of 19th-century British, American, and Irish fiction. The list includes the usual high frequency words (“the,” “of,” “an,” etc) but also several thousand personal names.")
    • Goldstone-Underwood Stoplist (6,032 words) (2013; stopword list used by Andrew Goldstone and Ted Underwood in their topic modeling work with English-language literary studies journals) (text file download link)
  • Other Languages:
    • Kevin Bougé Stopword Lists Page ("stop words for Arabic, Armenian, Brazilian, Bulgarian, Chinese, Czech, Danish, Dutch, English, Farsi, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish")
    • Ranks NL Stopword Lists Page (stop words for Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Latvian, Lithuanian, Marathi, Norwegian, Persian, Polish, Portugese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukranian, Urdu)

Reference Linguistic Corpora

(Curated corpus-linguistics corpora of language samples from specific nations and time periods designed to be representative of prevalent linguistic usage. May be used online [or downloaded for use in Antconc] as reference corpora for the purpose of comparing linguistic usage in specific texts against broader usage)

  • Corpus.byu.edu (Mark Davies and Brigham Young University's excellent collection of linguistic corpora)

Document/Image Collections

API = has API for automated data access or export
Img = Image archives in public domain

  • ARTFL: Public Databases (American and French Research on the Treasury of the French Language) (expansive collection of French-language resources in the humanities and other fields from the 17th to 20th centuries)
  • Avant-garde and Modernist Magazines (Monoskop guide to modernist avant-garde magazines; includes extensive links to online archives and collections of the magazines)
  • British Library Images ("We have released over a million images onto Flickr Commons for anyone to use, remix and repurpose. These images were taken from the pages of 17th, 18th and 19th century books digitised by Microsoft who then generously gifted the scanned images to us, allowing us to release them back into the Public Domain")
  • BT (British Telecom) Digital Archives ("Bhas teamed up with Coventry University and The National Archives to create a searchable digital resource of almost half a million photographs, reports and items of correspondence preserved by BT since 1846.... collection showcases Britain’s pioneering role in the development of telecommunications and the impact of the technology on society")
  • CELL (created by the Electronic Literature Organization, "Consortium on Electronic Literature (CELL) is an open access, non-commercial resource offering centralized access to literary databases, archives, and institutional programs in the literary arts and scholarship, with a focus on electronic literature")
  • Creative Commons Img
  • Digging Into Data Challenge - List of Data Repositories
  • Digital Public Library of America ("brings together the riches of America’s libraries, archives, and museums, and makes them freely available to the world") API
    • Apps and API's drawing on DPLA:
  • Digital Repositories from around the world (listed by EHPS, European History Primary Resources)
  • ELMCIP Electronic Literature Knowledge Base ("cross-referenced, contextualized information about authors, creative works, critical writing, and practices")
  • EEBO-TCP Texts (public-domain digitized texts from the Early English Books Online / Text Creation Partnership; "EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database.... these trace the history of English thought from the first book printed in English in 1475 through to 1700.")
  • EPC (Electronic Poetry Center)
  • Europeana (" digitised collections of museums, libraries, archives and galleries across Europe") API
  • Flickr Commons Img
  • Folger Library Digital Image Collection Img (" tens of thousands of high resolution images from the Folger Shakespeare Library, including books, theater memorabilia, manuscripts, and art. Users can show multiple images side-by-side, zoom in and out, view cataloging information when available, export thumbnails, and construct persistent URLs linking back to items or searches")
  • French Revolution Digital Archive Img (collaboration of the Stanford University Libraries and the Bibliothèque nationale de France to put online "high-resolution digital images of approximately 12,000 individual visual items, primarily prints, but also illustrations, medals, coins, and other objects, which display aspects of the Revolution")
  • Gallica (documents and images from Gallica, the "digital library of the Bibliothèque nationale de France and its partners")
  • GDELT: The Global Database of Events, Language, and Tone (downloadable datasets from "initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first "realtime social sciences earth observatory"; "Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates")
  • Getty Embeddable Images (major collection of stock and archival photos that includes over 30 million images that can be embedded in an iframe on a user's web page; in the embeddable image collection, hover over a photo and select the "</>" icon to get the HTML code for embedding)
  • Google Advanced Image Search Img (can be used to search by usage rights)
  • Google Maps Gallery (maps and map data from Google and content creators publishing their data on Google Maps)
  • HathiTrust Digital Library ("international partnership of more than 50 research institutions and libraries ... working together to ensure the long-term preservation and accessibility of the cultural record"; "more than 8 million volumes, digitized from the partnering library collections"; "more than 2 million of these volumes are in the public domain and freely viewable on the Web. Texts of approximately 120,000 public domain volumes in HathiTrust are available immediately to interested researchers. Up to 2 million more may be available through an agreement with Google that must be signed by an institutional sponsor. More information about obtaining the texts, including the agreement with Google, is available at http://www.hathitrust.org/datasets")
  • See also HathiTrust Research Center under Datasets below.
  • HuNI ("unlocking and uniting Australia's cultural datasets")
  • Internet Archive
    • Digital Books Collections
    • Book Images Img (millions of images from books that the Internet Archive has uploaded to its Flickr account; images are accompanied by extensive metadata, including information on location in original book, text immediately before and after the image, any copyright restrictions that apply, etc. Images also tagged to enhance searching)
  • Isidore (French-language portal that "allows researchers to search across a wide range of data from the social sciences and humanities. ISIDORE harvests records, metadata and full text from databases, scientific and news websites that have chosen to use international standards for interoperability")
  • The Japanese American Evacuation and Resettlement: A Digital Archive (UC Berkeley) (Approximately 100,000 images and pages of text; searches by creator, genre of document, and confinement location produce records with downloadable PDF's of original documents [OCR'd]
  • The Mechanical Curator ("public-domain "randomly selected small illustrations and ornamentations, posted on the hour. Rediscovered artwork from the pages of 17th, 18th and 19th Century books")
  • Media History Digital Library ("non-profit initiative dedicated to digitizing collections of classic media periodicals that belong in the public domain for full public access"; "digital scans of hundreds of thousands of magazine pages from thousands of magazine issues from 1904 to 1963") (to download plain-text versions of material, select a magazine and volume, click on the "IA Page" link, and on the resulting Internet Archive page for the volume click on the "All Files: HTTPS:" option; then save the file that ends "_djvu.txt")
  • Metadata Explorer (searches Digital Public Library of America, Europeana, Digital New Zealand, Harvard ,and other major library and collected repository metadata; then generates interactive network graphs for exploring the collections)
  • Metropolitan Museum of Art (NYC) Img (400,000 downloadable hi-res public domain images from the museum's collection, identified with an icon for "OASC" or Open Access for Scholarly Content"; see FAQ for OASC images)
  • National Archives (U. S.)
  • Nebraska Newspapers ("300,000 full-text digitized pages of 19th and early 20th Century newspapers from selected communities in Nebraska that can be used for text mining...TIFF images, JPEG2000, and PDFs with hidden text. Optical character recognition has been performed on the scanned images, resulting in dirty OCR")
  • New York Times TimesMachine (requires NY Times subscription; provides searchable facsimile and full-text PDF access to historical archives of the Times before 1980)
  • OAKsearch (portal for searching across multiple open-access collections for scholarly articles that are "digital, online, free-of-charge, and free of most copyright and licensing restrictions")
  • Open Images Img ("open media platform that offers online access to audiovisual archive material to stimulate creative reuse. Footage from audiovisual collections can be downloaded and remixed into new works. Users of Open Images also have the opportunity to add their own material to the platform and thus expand the collection")
  • Open Library ("We have well over 20 million edition records online, provide access to 1.7 million scanned versions of books, and link to external sources like WorldCat and Amazon when we can. The secondary goal is to get you as close to the actual document you're looking for as we can, whether that is a scanned version courtesy of the Internet Archive, or a link to Powell's where you can purchase your own copy")
  • Oxford University Text Archive
  • Perseus Digital Library ("Perseus has a particular focus upon the Greco-Roman world and upon classical Greek and Latin.... Early modern English, the American Civil War, the History and Topography of London, the History of Mechanics, automatic identification and glossing of technical language in scientific documents, customized reading support for Arabic language, and other projects that we have undertaken allow us to maintain a broader focus and to demonstrate the commonalities between Classics and other disciplines in the humanities and beyond")
  • Powerhouse Museum (Sydney) API
  • Project Gutenberg (42,000 free ebooks) [limited automated access (see also tips on downloading from Project Gutenberg)]
  • Ranking Web of Repositories (extensive ranked listings, with links, to world online repositories that include peer-reviewed papers)
  • Shared Self Commons ("free, open-access library of images. Search and browse collections with tools to zoom, print, export, and share images")
  • SNAC (Social Networks & Archival Contexts) | Prototype
  • Trove ("Find and get over 355,846,887 Australian and online resources: books, images, historic newspapers, maps, music, archives and more") API
  • VADS Online Source for Visual Arts Img ("visual art collections comprising over 100,000 images that are freely available and copyright cleared for use in learning, teaching and research in the UK")
  • YAGO2s ("huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames") [downloadable metadata]
  • Wellcome Collection of Historical Images (CC licensed)

Linguistic Corpora

(A "corpus" is a large collection of writings, sentences, and phrases. In linguistics, corpora cover particular nationalities, periods, and other kinds of language for use in the study of language.)

Map Collections

  • David Rumsey Map Collection (includes historical maps)
  • Library of Congress "American Memory: Map Collections" (focused on "Americana and Cartographic Treasures of the Library of Congress. These images were created from maps and atlases and, in general, are restricted to items that are not covered by copyright protection")
  • Mapping History (provides "interactive and animated representations of fundamental historical problems and/or illustrations of historical events, developments, and dynamics. The material is copyrighted, but is open and available to academic users...")
  • National Historial Geographical Information System (NHGIS) ("free of charge, aggregate census data and GIS-compatible boundary files for the United States between 1790 and 2011")
  • Old Maps Online ("easy-to-use gateway to historical maps in libraries around the world"; "All copyright and IPR rights relating to the map images and metadata records are retained by the host institution which provided them and any query regarding them must be addressed to that particular host institution")

Datasets (Public / Open Datasets)

(Includes some datasets accessible only though API's, especially if accompanied by code samples or embeddable code for using the API's.) [Currently this section is being collected; organization into sub-categories by discipline or topic may occur in the future]

Datasets in Other Disciplines (sample datasets for social-science, demographic, ethnicity/diversity, economic, health, media, and other research)

  • American National Election Studies datasets
  • Association of Religion Data Archives (ARDA) ("collection of surveys, polls, and other data submitted by researchers and made available online ... nearly 775 data files")
  • Australian Data Archive (ADA)
  • Berman Jewish DataBank (North America)
  • Bureau of Labor Statistics (U. S.)
  • CAIDA (Cooperative Association for Internet Data Analysis): Datasets, Monitors, and Reports ("Collection and sharing of data for scientific analysis of Internet traffic, topology, routing, performance, and security-related events is one of CAIDA's core objectives")
  • Cat Dataset ("10,000 cat images. For each image, we annotate the head of cat with nine points, two for eyes, one for mouth, and six for ears")
  • CDC Wonder (public health data and related data from the U. S. Centers for Disease Control & Prevention)
  • Census Bureau Data Tools and Apps (data from the U. S. Census Bureau)
  • Center for International Earth Science Information Network (CIESIN) Data Links By Subject
  • Center for Population Research in LGBT Health: Data Resources
  • China Data Center (from U. Michigan)
  • Data is Plural (extensive directory of public datasets of many kinds by Jeremy Singer-Vine, including many curious and small ones--e.g., the number of squirrels in New York's Central Park according to a squirrel count)
  • DataFerrett ("data analysis and extraction tool to customize [U. S.] federal, state, and local data to suit your requirements.... you can develop an unlimited array of customized spreadsheets that are as versatile and complex as your usage demands then turn those spreadsheets into graphs and maps without any additional software")
  • Data and Story Library (DASL) (pedagogically oriented site with large number of sample datasets accompanied by "stories" that apply "a particular statistical method to a set of data")
  • Data.gov ("home of the US government’s open data. You can find Federal, state and local data, tools, and resources to conduct research, build apps, design data visualizations, and more")
  • Data.gov.uk
  • Data on the Net ("Search or browse our listing of 363 Internet sites of numeric Social Science statistical data, data catalogs, data libraries, social science gateways, addresses and more")
  • Datasets for Data Mining and Data Science (from KDnuggets)
  • DiversityData.org (U. S.) ("Create customized reports describing over 100 measures of diversity, opportunity, and quality of life for 362 metropolitan areas")
  • Economic Policy Institute Datazone
  • Enron Email Dataset ("data from about 150 users, mostly senior management of Enron ... contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation)
    • EnronSent Corpus ("special preparation of a portion of the Enron Email Dataset designed specifically for use in Corpus Linguistics and language analysis ... created by cleaning up a portion of the original Enron Corpus. It contains 96,107 messages from the "Sent Mail" directories of all the users in the corpus. ... an attempt has been made to remove as much non-human generated text as possible from the raw messages in the original data")
  • Global Health Observatory (GHO) (from the World Health Organization)
  • Global Terrorism Database (GTD) ("open-source database including information on terrorist events around the world from 1970 through 2012 [with annual updates planned for the future].... includes systematic data on [U. S.] domestic as well as international terrorist incidents ... includes more than 113,000 cases")
  • [Google Datasets Search Engine]
  • Higher Education Datasets (U. S.) (from Data.gov)
  • Homeland Security (U. S.) Data and Statistics (from U. S. Department of Homeland Security)
  • ICWSM-2009 Blogs Dataset ("44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. ... formatted in XML and is further arranged into tiers approximating to some degree search engine ranking.... To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form is processed ... you will be sent a URL and password where you can download the collection")
  • Immigration Statistics (from U. S. Department of Homeland Security)
  • Infochimps Financial Datasets (including stock market historical datasets in csv format)
  • Mexican Migration Project (MMP)
  • National Archive of Criminal Justice Data (U. S.)
  • National Atlas Data Download (U. S.)
  • New York Times - Developers (Search by API) ("why just read the news when you can hack it?"; API's for accessing headlines, abstracts, first paragraphs, links, etc. to NYT data ; includes API's for articles, best sellers, comments by users, most popular items, newswire, and other parts of the Times)
  • Open Context: Downloadable Data Tables (archaeological research)
  • Pew Research Center Datasets (datasets for the following Pew Research projects: People & the Press; Journalism; Hispanic Trends; Global Attitudes; Internet and American Life; Social & Demographic Trends; Religion & Public Life)
  • Public Datasets (list created by Vivek Patil)
  • Qualitative Data Repository (QDR) ("selects, ingests, curates, archives, manages, durably preserves, and provides access to digital data used in qualitative and multi-method social inquiry"; Syracuse U.)
  • Quora List of Large Public Datasets
  • Reddit 2.5 Million Posts ("dataset of the all-time top 1,000 posts, from the top 2,500 subreddits by subscribers, pulled from reddit between August 15–20, 2013")
  • Réseau Quetelet (French social-science datasets)
  • Resource Center for Minority Data (RCMD) (U. S.)
  • Robert Seaton, "100+ Interesting Data Sets for Statistics" ("Looking for interesting data sets? Here's a list of more than 100 of the best stuff, from dolphin relationships to political campaign donations to death row prisoners") (2014)
  • Sharing Ancient Wisdoms (SAW): Data (RDF data files and other data for project on "collections of ideas and opinions – ranging from pithy sayings to short passages from longer philosophical texts - which make up the ancient genre of Wisdom Literature")
  • SMS Spam Collection ("public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam"
  • Social Computing Data Repository (datasets for social media research from Arizona State U.)
  • Spambase Data Set (University of Callifornia, Irvine's dataset)
  • Stanford Large Network Dataset Collection (datasets of nodes and edges for social networks, communication networks, citation networks, collaboration networks, Amazon networks, road networks, Twitter, online communities, etc.)
  • Terrorism and Preparedness Data Resource Center (U. Michigan; "archives and distributes data collected by government agencies, non-governmental organizations (NGOs), and researchers about the nature of intra- (domestic) and international terrorism incidents, organizations, perpetrators, and victims; governmental and nongovernmental responses to terror, including primary, secondary, and tertiary interventions; and citizen's attitudes towards terrorism, terror incidents, and the response to terror")
  • Texas Department of Criminal Justice Death Row Information (dataset of last words of prisoners executed since 1984)
  • Time Series Data Library
  • Twitter Data set for Arabic Sentiment Analysis Data Set
  • UK Data Archive
  • UNdata (data from the Statistics Division of the United Nations Department of Economic and Social Affairs; includes data access by API)
  • University of California, Irvine, Machine Learning Repository Datasets
  • University College Dublin's Open Data Sets of Social Networks (overview)
  • U. S. National Survey on Drug Use and Health, 2012 (datasets)
  • U. S. Survey of Inmates in State and Federal Correctional Facilities, 2004 (datasets)
  • World Bank Poverty & Equity Data
  • World Data Center for Human Interactions in the Environment (from Columbia U.)
  • World Values Survey (WVS) ("global research project that explores people’s values and beliefs, their stability or change over time and their impact on social and political development of the societies in different countries of the world")
  • Yelp Academic Dataset ("Yelp is providing all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research. We've provided some examples on GitHub to get you started. To get them running, you will need to install MRJob, our python framework for Map-Reduce computing")

License

CC0

To the extent possible under law, Richard Dennis has waived all copyright and related or neighboring rights to this work.