archives icon indicating copy to clipboard operation
archives copied to clipboard

Download all of data.gov

Open flyingzumwalt opened this issue 8 years ago • 12 comments

For more info about this task, what we will do with the data, and how it relates to other archival efforts, see Issue #87

Story

Jack Downloads all of the datasets from data.gov (~350TB) to storage devices on Stanford's network.

What will be Downloaded

The data.gov website is a portal that allows you to find all the "open data" datasets published by US federal agencies. It currently lists over 190,000 datasets

The goal is to download those datasets, back them up, and use IPFS to replicate the data across a network of participating/collaborating nodes.

@mejackreed has posted all of the metadata from data.gov, which cointains pointers to the datasets and basic metadata about them. The metadata are in ckan.json files. You can view the metadata at https://github.com/OpenGeoMetadata/gov.data That will be the main starting point for running all of the scripts that download the datasets.

flyingzumwalt avatar Jan 17 '17 17:01 flyingzumwalt

Does this really need to be >300TB. After looking at the data, there is a lot of data redundancy. Same data is in csv, html and json. does only one organization have to load the entire 300 TB? Most of the data can be broken up to 'health', 'environment', "agriculture' and is composed on heterogeneous files ( typically a few hundred MB per file.) The meta data describing the data would be most important ( Publisher, Identifier, modified date, etc).

jonnycrunch avatar Jan 17 '17 19:01 jonnycrunch

We have the ckan metadata already. And yes I agree some of the data is redundant, based on how ArcGIS OpenData allows for different types of exports. A smarter heuristic of this would be nice, but may take some more analysis time.

mejackreed avatar Jan 17 '17 20:01 mejackreed

@mejackreed do you think you will need help writing the download scripts or running them? We can probably find people to help you.

flyingzumwalt avatar Jan 18 '17 18:01 flyingzumwalt

Sure thing. Help definitely wanted! I have a naive downloader here: https://github.com/mejackreed/GovScooper/blob/master/README.md#usage already.

mejackreed avatar Jan 18 '17 19:01 mejackreed

cc @jbenet @gsf @b5

flyingzumwalt avatar Jan 18 '17 19:01 flyingzumwalt

Happy to help!

I think it makes sense to first decide weather or not to download in passes, using metadata to cut down on data redundancy (as per @jonnycrunch's suggestion), or to just beef the whole thing. I'd personally vote for the "passes" approach, but first checking to ensure that the data is truly redundant.

b5 avatar Jan 18 '17 19:01 b5

Yep i have an idea on how to evaluate whether or not the data is redundant or not. Resources that come from a server that has /arcgis.com/ and have .geojson + .csv + .kml are usually just transformations of the same data. A way to understand these types of datasets / resources and codify the heuristics is needed.

An example: https://github.com/OpenGeoMetadata/gov.data/blob/8f440134f13e7559086e7a07b8081098198c9a18/ad/01/6d/50/3d/38/4b/50/bc/b9/e5/62/2f/d7/c0/1b/ad016d503d384b50bcb9e5622fd7c01b/ckan.json

mejackreed avatar Jan 18 '17 19:01 mejackreed

There are 194422 distinct entries in catalog. Meta data is about 2GB.

https://catalog.data.gov/api/3/action/package_search?rows=1&start=0

Here is an example of one entry:
https://catalog.data.gov/api/3/action/package_show?id=1e68f387-5f1c-46c0-a0d1-46044ffef5bf

each entry has a resource list:

First pass could be to hit all of the URLs in each resource grab 'Content-Length' headers to calculate the exact amount of space needed simultaneously gathering all of the necessary resource urls.

There are also some meta schema resources referenced in the 'extras' section that would be important to grab: https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld

jonnycrunch avatar Jan 18 '17 19:01 jonnycrunch

Mmm, at least there were 194422 entries, now there are only 194401. Now I understand the urgency!

jonnycrunch avatar Jan 18 '17 20:01 jonnycrunch

+1 for hitting all resources for content length. I'd add grabbing filetype while we're at it. quick browsing showed some of the resources listed were .zip archives (ugh)

b5 avatar Jan 18 '17 20:01 b5

So in my initial tests of downloading these resources, many of them do not return Content-Length header unfortunately. Hoping to kick off some larger runs this afternoon to get more details.

mejackreed avatar Jan 18 '17 20:01 mejackreed

@jonnycrunch 194014 entries here: https://github.com/OpenGeoMetadata/gov.data

Best to grab the archive.zip and easy to parse the layers.json file

mejackreed avatar Jan 18 '17 20:01 mejackreed