nltk_data icon indicating copy to clipboard operation
nltk_data copied to clipboard

Alternate mirroring of nltk_data on Zenodo

Open alvations opened this issue 8 years ago • 1 comments

One proposal to handle nltk/nltk#1787 is to find alternative site that can handle content distribution network and resolve high frequency requests appropriately.

After some playing around, it's possible to mirror all nltk_data packages on Zenodo and uploading /update them automatically, e.g.

import requests
import xml.etree.ElementTree as ElementTree
import json

# Use the access code from Zenodo, see https://zenodo.org/account/settings/applications/tokens/new/
ACCESS_TOKEN = '...'

# Download and reads the index.xml
index_url = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml'
nltk_index = ElementTree.fromstring(requests.get(index_url).content).find('packages')

pathto_nltkdata_locally = '/Users/alvas/nltk_data/'

for package in nltk_index.findall('package'):
    # Gets the package meta-data from index.xml
    nltk_json = package.attrib
    
    r = requests.get('https://zenodo.org/api/deposit/depositions',
                     params={'access_token': ACCESS_TOKEN})

    # Gets a new access Zenodo ID for this package. 
    headers = {"Content-Type": "application/json"}
    r = requests.post('https://zenodo.org/api/deposit/depositions',
                     params={'access_token': ACCESS_TOKEN}, json={},
                     headers=headers)
    deposition_id = r.json()['id']

    # Find path to package locally.
    package_location = pathto_nltkdata_locally + nltk_json['url'][67:]
    
    # Uploads the File to Zenodo
    files = {'file': open(package_location, 'rb')}
    r = requests.post('https://zenodo.org/api/deposit/depositions/%s/files' % deposition_id,
                      params={'access_token': ACCESS_TOKEN}, files=files)
    
    # Add the metadata.
    data = {'metadata': {
                        'title': nltk_json['name'],
                        'description': json.dumps(nltk_json),
                        'upload_type': 'dataset',
                        }
            }
    r = requests.put('https://zenodo.org/api/deposit/depositions/%s' % deposition_id,
                     params={'access_token': ACCESS_TOKEN}, data=json.dumps(data),
                     headers=headers)
    
    r = requests.post('https://zenodo.org/api/deposit/depositions/%s/actions/publish' % deposition_id,
                      params={'access_token': ACCESS_TOKEN} )

We can add these links to the index.xml and change the code in nltk.downloader.py to look for alternative mirror when the github raw content links fails, e.g.

<package 
author="Philipp Koehn, University of Edinburgh" checksum="7621d5675990b1decc012c823716ee76" 
id="europarl_raw" 
name="Sample European Parliament Proceedings Parallel Corpus" size="12594977" 
subdir="corpora" 
unzip="1" unzipped_size="41396100" 
url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/europarl_raw.zip" 
url2="https://zenodo.org/record/123456/files/europarl_raw.zip"
webpage="http://www.statmt.org/europarl" />

Or perhaps have an index-github.xml and index-zenodo.xml

The code snippet above needs to be cleaned up and also there's a need to track the packages already uploaded to Zenodo such that we can update them when necessary.

But is this a viable idea? Suggestions? Alternative proposals?

alvations avatar Aug 04 '17 08:08 alvations

@alvations: sorry for the delay. This looks like a great suggestion. Why don't we incorporate it experimentally as a backup option, and then if it seems stable, switching to it as the default?

stevenbird avatar Oct 22 '17 20:10 stevenbird