nltk_data
nltk_data copied to clipboard
How can I contribute a corpus?
I'd like to contribute a corpus. What format does the corpus need to be in? Does it need to be POS tagged?
It's a corpus of about 65K books from the British Library. Currently, they're only XML files, but I'm working on getting them in plaintext, as well. You can see a few sample files in https://github.com/Git-Lit/git-lit/tree/master/data2. It's about 1TB, or 250GB compressed, so it won't fit in this GH repo. However, I'm making github repositories for each text in the corpus. So all that would be needed is a way for nltk.corpus.download() to grab each text in this corpus, given a URL for each one. What would be the best way of doing that?
@JonathanReeve: nice idea. As you point out, it would require an extension to the downloader. Something like this has already been proposed in https://github.com/nltk/nltk/issues/59. We don't have resources to do this, but it would be a very welcome contribution. Best to discuss it on the nltk-dev mailing list.