ftptree
ftptree copied to clipboard
Crawl an FTP site and visualize file size-weighted directory tree
FTPTREE
Generates browsable tree map of an FTP site weighted by the amount of data in
each directory. FTP site is crawled using scrapy.
Installation
Requires python>=2.7.
Required python modules:
-
scrapyfor crawling FTP sites -
squarifyfor laying out tree maps -
bottlefor lightweight web server
Modules used for visualization/front end:
-
d3.js
-
Twitter Bootstrap
-
jQuery
Clone package from GitHub, e.g.,
git clone git://github.com/laserson/ftptree.git
The web app is run directly from the root project directory using servetree.py.
Usage overview
sites.json contains a JSON list of FTP site metadata objects describing which
sites to include in the visualization.
crawltree.py crawls an FTP site and generates a JSON object representation of
the directory tree, including sizes of the files.
crawlsites.py is a script to crawl each site listed in sites.json.
servetree.py is the Bottle.py app that serves the visualization.
The static/ directory contains the Bootstrap files.
index.html is the main d3.js visualization.
To crawl:
scrapy crawl ftptree -a config_file=sites/ncbi.json -s JOBDIR=tmp_crawl/ncbi -o crawls/ncbi.txt -t jsonlines
scrapy crawl ftptree -a config_file=sites/ucsc.json -s JOBDIR=tmp_crawl/ucsc -o crawls/ucsc.txt -t jsonlines
scrapy crawl ftptree -a config_file=sites/cdc.json -s JOBDIR=tmp_crawl/cdc -o crawls/cdc.txt -t jsonlines
How to crawl an FTP tree
The crawltree.py script parses results from an FTP LIST command. The
results differ based on the server properties. Primarily, the data format can
be "unix", "windows", or "mlsd". The MLSD command is preferred as it returns
pre-parsed file information.
To determine the method to use for a given FTP site, run e.g.
./crawltree.py --host ftp.cdc.gov --output data/cdc.json --test-method
which will return a sample listing. It will specify whether MLSD succeeded or
failed. If failed, it will show an example listing so the user can determine
whether it's Unix-like or Windows-like.
After the appropriate listing method is determined, a typical crawling command is issued like so:
./crawltree.py --host ftp.cdc.gov --output data/cdc.json --method windows
You can specify where to start the crawl by adding a --root path/to/root
option, e.g.,
./crawltree.py --host hgdownload.cse.ucsc.edu --root goldenPath --output data/ucscgb.json --method mlsd
OLD OLD OLD
Crawled FTP sites
-
CDC:
./crawltree.py --host ftp.cdc.gov --output data/cdc.json --method windows -
NCBI:
./crawltree.py --host ftp.ncbi.nih.gov --output data/ncbi.json --method mlsd -
1kg:
./crawltree.py --host ftp-trace.ncbi.nih.gov --root 1000genomes/ftp --output data/1kg.json --method mlsd -
Genome Browser:
./crawltree.py --host hgdownload.cse.ucsc.edu --root goldenPath --output data/ucscgb.json --method mlsd
ftp://ftp.ncbi.nlm.nih.gov/sra
ftp://ftp.fcc.gov/ ftp://ftp.rma.usda.gov/pub/ ftp://ftp.epa.gov/ ftp://ftp.fsa.usda.gov/ ftp://ftp.ngdc.noaa.gov/ ftp://tgftp.nws.noaa.gov/ ftp://ftp.ncdc.noaa.gov/pub ftp://ftp.cdc.noaa.gov/ ftp://ftp2.census.gov ftp://emi.nasdaq.com/ ftp://ftp.nasdaqtrader.com/ ftp://ftp.resource.org/ ftp://ftp.uspto.gov/pub/ ftp://ftp.eia.doe.gov/ ftp://ftp.broadinstitute.org/pub ftp://ftpext.usgs.gov/pub/