process-google-dataset
process-google-dataset copied to clipboard
Process Google Dataset is a tool to download and process images for neural networks from a Google Image Search using a Chrome extension and a simple Python code.
trafficstars
procecss-google-dataset
PGD is a toolchain that works on any operating system that is capable of running Chrome and Python. It has no limit to the number of images it can retrieve and download. It does not require any subprocess call or specific configuration.
Google Extension
Requirements
- Google Chrome
Installation
Option 1
- Download CRX file from latest release.
- Type
chrome://extensions/in the Chrome browser top bar. - Toggle
Developer Modeswitch on from the top right corner. - Drag and drop the crx file to the middle of the window.
Option 2
- Clone the process-google-dataset repo.
- Type
chrome://extensions/in the Chrome browser top bar. - Toggle
Developer Modeswitch on from the top right corner. - Click
Load Unpackedand select the cloned repo root directory.
How to use
- Navigate to https://images.google.com.
- Search for the dataset keyword. (eg. car)
- To get more data, simply keep scrolling to the bottom of the search page and loading more data. The tool will retrive all the data it can see.
- Find the extension logo at the top right corner and click "Parse and Download Metadata".
- A JSON file will be downloaded to the "Downloads" directory.
Python Downloader
Requirements
- Python 3
How to use
Example
python3 download.py --json-path /path/to/downloaded/json/file/from/extension/ --label cars --output-dir /path/to/output/directory
All options
--label
Name of subdirectory/label that describes data.
--json-path
Path to JSON file downloaded from extension.
--output-dir
Directory where a new directory will be created based on label name and the images will be stored.
--timeout
(OPTIONAL) Timeout time in seconds when the downloader will move on to the next image.