Alternate mirroring of nltk_data on Zenodo
One proposal to handle nltk/nltk#1787 is to find alternative site that can handle content distribution network and resolve high frequency requests appropriately.
After some playing around, it's possible to mirror all nltk_data packages on Zenodo and uploading /update them automatically, e.g.
import requests
import xml.etree.ElementTree as ElementTree
import json
# Use the access code from Zenodo, see https://zenodo.org/account/settings/applications/tokens/new/
ACCESS_TOKEN = '...'
# Download and reads the index.xml
index_url = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml'
nltk_index = ElementTree.fromstring(requests.get(index_url).content).find('packages')
pathto_nltkdata_locally = '/Users/alvas/nltk_data/'
for package in nltk_index.findall('package'):
# Gets the package meta-data from index.xml
nltk_json = package.attrib
r = requests.get('https://zenodo.org/api/deposit/depositions',
params={'access_token': ACCESS_TOKEN})
# Gets a new access Zenodo ID for this package.
headers = {"Content-Type": "application/json"}
r = requests.post('https://zenodo.org/api/deposit/depositions',
params={'access_token': ACCESS_TOKEN}, json={},
headers=headers)
deposition_id = r.json()['id']
# Find path to package locally.
package_location = pathto_nltkdata_locally + nltk_json['url'][67:]
# Uploads the File to Zenodo
files = {'file': open(package_location, 'rb')}
r = requests.post('https://zenodo.org/api/deposit/depositions/%s/files' % deposition_id,
params={'access_token': ACCESS_TOKEN}, files=files)
# Add the metadata.
data = {'metadata': {
'title': nltk_json['name'],
'description': json.dumps(nltk_json),
'upload_type': 'dataset',
}
}
r = requests.put('https://zenodo.org/api/deposit/depositions/%s' % deposition_id,
params={'access_token': ACCESS_TOKEN}, data=json.dumps(data),
headers=headers)
r = requests.post('https://zenodo.org/api/deposit/depositions/%s/actions/publish' % deposition_id,
params={'access_token': ACCESS_TOKEN} )
We can add these links to the index.xml and change the code in nltk.downloader.py to look for alternative mirror when the github raw content links fails, e.g.
<package
author="Philipp Koehn, University of Edinburgh" checksum="7621d5675990b1decc012c823716ee76"
id="europarl_raw"
name="Sample European Parliament Proceedings Parallel Corpus" size="12594977"
subdir="corpora"
unzip="1" unzipped_size="41396100"
url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/europarl_raw.zip"
url2="https://zenodo.org/record/123456/files/europarl_raw.zip"
webpage="http://www.statmt.org/europarl" />
Or perhaps have an index-github.xml and index-zenodo.xml
The code snippet above needs to be cleaned up and also there's a need to track the packages already uploaded to Zenodo such that we can update them when necessary.
But is this a viable idea? Suggestions? Alternative proposals?
@alvations: sorry for the delay. This looks like a great suggestion. Why don't we incorporate it experimentally as a backup option, and then if it seems stable, switching to it as the default?