googleart_scraper icon indicating copy to clipboard operation
googleart_scraper copied to clipboard

Scrape images from googleart

Google art project scraper

This is a scrapy crawler for Google art
Use it at your own risk to be banned by google.
Scrape politely.

van Gogh,
Self-Portrait van Gogh,
Portrait of Joseph Roulin

Description

The crawler does:

  1. Parses list of artists page.
  2. Parses pages for every individual artist, extracting info about them. For ex.: Rembrandt.
  3. Extracts a list of artworks for every artist, download images at max resolution 512x512.
  4. Parses pages for every individual artwork, extracting info about them. For ex.: van Gogh, Self-Portrait.

All the collected data is stored in MongoDb.
Images are stored on disk.

One entry in database for an artwork looks like (some artworks cane have more fields):

{
   "image_id" : "oz_mural-millerntor-gallery_kAFxJz3YnrJ4rw",
   "artist_name_extra" : "OZ",
   "title" : "Mural MILLERNTOR GALLERY",
   "artwork_slug" : "mural-millerntor-gallery_kAFxJz3YnrJ4rw",
   "image_url" : "https://lh5.ggpht.com/OT7x4vhUw28Nb7mtvO8h017Pgn-S0B-x2Q9Dej9p8I0SdD-U46UrODnMsRZQJw",
   "page_url" : "https://www.google.com/culturalinstitute/beta/asset/mural-millerntor-gallery/kAFxJz3YnrJ4rw",
   "artist_slug" : "oz",
   "artist_id" : "t2194x6zy1b",
   "dimensions" : "spray paint on wall, ca. 6.0 x 3.0m",
   "location_created" : "St. Pauli, Hamburg, Germany",
   "title_original" : "Mural MILLERNTOR GALLERY",
   "date" : "2012-09"
}

Dependencies

Setup

TODO:

Usage

:~$ scrapy crawl googleart