Data-Rescue-PDX icon indicating copy to clipboard operation
Data-Rescue-PDX copied to clipboard

NASA Scraper List

Open max-mapper opened this issue 8 years ago • 14 comments

If you are looking for data to scrape, here are some NASA acronyms to get you started:

https://data.nasa.gov (We learned yesterday that everything on data.nasa.gov is also on data.gov) GCMD Echo CMR DAACS OpenDAP NSIDC EOSDIS

Nasa data locations: Goddard Huntsville Oakridge JPL (LA) Ames

Our goal with these is to:

  • Find out where nasa public data is listed in an API or scraper accessible place
  • Find out if anyone has already grabbed the metadata from these places
  • If you cant find the complete metadata in one downloadable archive/backup, write a scraper
  • Your scraper should list through all the datasets and create 1 JSON entry per item, ideally with URL links to the direct downloadable raw data

Comment below if you are working on one of these repositories

max-mapper avatar Mar 04 '17 18:03 max-mapper

https://earthdata.nasa.gov/nasa-data-policy

You may need an earthdata login to access some of the data, it's a free registration.

Also here is a list of all the servers FTP and HTTP from data.gov, which includes many NASA ftp servers https://gist.github.com/maxogden/9885244926c1ab576287ff5047dd0e5f

max-mapper avatar Mar 04 '17 18:03 max-mapper

Working on Goddard Space Flight Center. Mr. Google sent me here:

https://daac.gsfc.nasa.gov/

And they encourage wget!!

https://disc.gsfc.nasa.gov/recipes/?q=recipes/How-to-Download-Data-Files-from-HTTP-Service-with-wget

I can code this up ... do we want to put it up on a server somewhere?

znmeb avatar Mar 04 '17 19:03 znmeb

https://genelab-data.ndc.nasa.gov/genelab/projects

A very nice database for genetic research done IN SPACE!

jmicrobe avatar Mar 04 '17 20:03 jmicrobe

Sam and I are doing NSIDC

sckott avatar Mar 04 '17 20:03 sckott

For the Earth Sciences Level 1 and Atmosphere Archive and Distribution System (LAADS) DAACS, they have archived all of their data on both ftp and http sites: ftp://ladsweb.modaps.eosdis.nasa.gov https://ladsweb.nascom.nasa.gov/archive

Useful Readme of the data contained and how to access is here: https://ladsweb.nascom.nasa.gov/archive/README

samavar14 avatar Mar 04 '17 20:03 samavar14

Actually, it looks like all the DAACS' data is contained in the Common Metadata Repository: https://wiki.earthdata.nasa.gov/display/CMR/CMR+Client+Partner+User+Guide. Based off this, we would only need one scraper to pull all data from this system?

samavar14 avatar Mar 04 '17 21:03 samavar14

I've got dibs on crawling https://opendap.larc.nasa.gov/opendap/ 🚀

shawnbot avatar Mar 06 '17 01:03 shawnbot

I took a look at the CMR page and started parsing the metadata provided at https://cmr.sit.earthdata.nasa.gov/search/collections.json.

I put together a script that traces the the files linked there with curl and outputs their final place after redirects: https://gist.github.com/crhallberg/eebc86dd74ec36e9f2f522ac1559cb7b.

That's just the bare-bones version. I also have one that does a lot more (saves collections.json, separates files into data, webpage, and broken, has status output) if needed.

crhallberg avatar Mar 06 '17 18:03 crhallberg

@crhallberg awesomeness, do you have an idea of how many datasets are available under that collections endpoint? is each collection a big group of datasets? do you have an example of the metadata that your script produces?

max-mapper avatar Mar 06 '17 18:03 max-mapper

I'm glad you asked because I'm still very new to this. There is a LOT more info here than I thought. My initial thought that what I was parsing was an update feed. Turns out I was on page 1 of 19,590 items. ~~I still don't know how many. A part of the documentation I just found says "You can not page past the 1 millionth item." so there is (obviously) a heck of a lot.~~

Do you have any examples of good metadata that I can aim for as I interate on this?

crhallberg avatar Mar 06 '17 21:03 crhallberg

@crhallberg hah! that's a lot of data :) if you wanna check out the data.gov metadata, the gold standard in my opinion, check out this guide i wrote last month https://github.com/jsonlines/guide. the main idea is you have a JSON object for each dataset, and that object has an array of resource URLs, one for each data file.

max-mapper avatar Mar 07 '17 19:03 max-mapper

Is this related to the tweet https://twitter.com/denormalize/status/838550043397234691 ? I was wondering if you found a solution to the parallel ftp problem.

nichoth avatar Mar 08 '17 17:03 nichoth

Update: I've identified 48,126 links. Some are invalid, some are ftp folders, I'm weeding through now by checking headers. After I've separated the wheat links from the chaff links, I'll reconcile it with the original metadata.

I will place a link here when I have a centralized place to show and tell progress: https://github.com/crhallberg/nasa-cmr-scraper.

crhallberg avatar Mar 09 '17 21:03 crhallberg

I wasn't sure where else to push this, so I just made a new repository: https://github.com/crhallberg/nasa-cmr-scraper

crhallberg avatar Mar 15 '17 16:03 crhallberg