cc.py
cc.py copied to clipboard
Extracting URLs of a specific target based on the results of "commoncrawl.org"
cc.py
Extracting URLs of a specific target based on the results of "commoncrawl.org".
Updated to v.0.3 | Whats new:
- 65% faster proceeding
- Specify a year via
-y/--year, e.g.:-y 2018 - Specify an output file via
-o/--out, e.g.:-o whatever.txt - Crawl all pages for a specific index
-i/--index, e.g.:cc.py army.mil -i CC-MAIN-2018-05 - List all available indexes
-l/--list, e.g.:cc.py army.mil -l
ToDo
- [x] Crawl for a specific index
- [ ] Implementation of multithreading
- [ ] Allowing a range of years as input
- [ ] Implementing
direct-grep - [ ] Temporary file-writing
Usage
cc.py [-h] [-y YEAR] [-o OUT] [-l] [-i INDEX] [-u] domain
positional arguments:
domain domain which will be crawled for
optional arguments:
-h, --help show this help message and exit
-y YEAR, --year YEAR limit the result to a specific year (default: all)
-o OUT, --out OUT specify an output file (default: domain.txt)
-l, --list Lists all available indexes
-i INDEX, --index INDEX
Crawl for a specific index (this will crawl all
pages!)
-u, --update Update index file
Example
python3 cc.py github.com -y 2018 -o github_18.txt
cat github_18.txt | grep user
Dependencies
- Python3