textacy icon indicating copy to clipboard operation
textacy copied to clipboard

more, better example corpora

Open bdewilde opened this issue 9 years ago • 8 comments

textacy currently has one, small example corpus — the "Bernie and Hillary" corpus containing 3000 speeches and basic metadata from the Congressional Record — and readers for two, very large corpora — streams of Wikipedia pages and Reddit comments from standardized, publicly-available database dumps. We want more options.

potential datasets / options

  • thousands of (mostly old) books are available at Project Gutenberg
  • U.S. Supreme Court decisions (see here for an example of how to get these documents)
  • larger, more composable collection of Congressional speeches from the Sunlight Foundation Capitol Words API that would enable subsetting by speaker to get, say, the equivalent of Bernie and Hillary
  • net neutrality comments on the FCC website, also via Sunlight Foundation here
  • thousands of descriptions of websites in a variety of categories at JC-Bingo (note: I can't find any information on TOS)
  • a streaming reader for the Enron email corpus (downloadable here)
  • a streaming reader for the Ontonotes 5 corpus
  • better filtering options for the Wikipedia and Reddit corpus readers; say, by Wikipedia category or subreddit

There are lots of other options! The only requirement is that the license / terms of service don't prohibit free, public distribution of the data.

implementation in textacy

  • stream one document at a time from disk
  • filter or group by some parameters or metadata
  • metadata in addition to text (preferred)
  • variety of content compared to other available corpora (preferred)
  • what else...?

bdewilde avatar Jul 20 '16 15:07 bdewilde

I maintain a machine-readable list of textual corpora that was originally created for use in the command-line program corpus-downloader. At the moment there's not too much there, and it skews heavily toward literary and linguistic corpora, but feel free to grab any of the corpora listed there, or to submit a pull request with any more you can think of.

JonathanReeve avatar Jan 27 '17 20:01 JonathanReeve

@bdewilde Not sure if OA-STM is an option. But I think this would be cool to implement or at least reference. Hat tip @stephenhky. Let me know if you want me to look into it further - and I will learn how to get into textacy.

gryBox avatar Feb 05 '17 05:02 gryBox

Hi @gryBox , thanks for the pointer to OA-STM. Looks like there's a lot of interesting data beyond just the plain text! I think it would be great to have a reader class in textacy.corpora. There are a few examples in there that you can use for reference, but I'm open to new ideas and APIs.

Thanks also to @JonathanReeve for the lit-oriented corpus list. textacy has readers for political (the congressional record), legal (supreme court decisions), reference article (wikipedia), and social media (reddit) text, but no literature. As soon as I have time, I'll see about pulling a good one in. Or feel free to submit a PR, I promise a prompt review! :)

bdewilde avatar Feb 07 '17 18:02 bdewilde

Thanks @gryBox for the recommendations. I am in need of text corpus too!

stephenhky avatar Feb 07 '17 22:02 stephenhky

Quora released a dataset of question pairs. (Albeit in CSV format). It would be a nice example to show the distance similarity metrics and NER also. (And to get traffic, given the kaggle contest + public attention on the topic) https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

ddofer avatar Mar 23 '17 15:03 ddofer

Hey @ddofer , thanks for the ping. I've been meaning to play with this dataset, but have been dragging my feet / waiting for spacy v2.0. (Have you seen this blog post by the spacy folks?) I'm slowly working myself up to adding NN models into textacy, and this dupe question dataset is definitely something to consider including as an example corpus. Will keep you posted here...

bdewilde avatar Mar 23 '17 15:03 bdewilde

I've definelty seen the spacy post (and it's sequel). I've been trying to use textacy on the challenge, though it's tricky given the format and how feature engineering works here..

ddofer avatar Mar 23 '17 15:03 ddofer

I am unsure if this is the correct place, but it seems like you folks care about training corpora. Are there any good tutorials on creating one's own? I want to build a training corpora from Project Gutenberg, but I don't know where to begin, or how to convert it into a format that textacy will understand.

andyhappy1 avatar Oct 16 '17 18:10 andyhappy1