djangoscraper icon indicating copy to clipboard operation
djangoscraper copied to clipboard

Django Scrapy App

h2. Django Scraper is obsolete

Django Scraper was written for scrapy 0.7. Since then, scrapy 0.8 came out that had many improvement, many of these improvements make Django Scraper architecturally obsolete. I abandoned this project a while ago.

if you're looking for this kind of functionality, I would recommend that you look into "celery":http://ask.github.com/celery/getting-started/introduction.html in combination with latest version of scrapy. This would give you a scallable implementation of task based scraping.

h2. Django Scraper app - djangoscraper

h3. Introduction

Django Scraper app is an integration of Django Web Framework and Scrapy Web Crawling Framework. It was created to simplify scraping of large websites that contain a variety of data that needs to be extracted in different ways.

As I began working with scrapy I found it difficult to manage the complexity of the website that I was trying to scrape, because scrapy architecture requires you to have 1 spider per domain. This contraint made it difficult for me to structure the code in a clear and modular way because the code for all of the scraping tasks had to be in the same spider.

I prefer to think of spiders as having tasks. This makes it easier for me to work on specific spider functionality without involving all of the other spider tasks.

To work on this way, I introduced a concept of a spider Task. A spider task is something that a spider has to do and it produces either items or other spider tasks.

Tasks are stored in Django database and can be manipulated using Django admin interface. Django admin allows you to add new tasks, view status of tasks, filter tasks.

Tasks have similar properties to scrapy Requests, except they take multiple urls using the start_urls property.

h3. Installation

Django Scraper App functions like a standard Django Application. If you follow a non django code organization then you would install djangoscraper as you would any other django application.

h4. New Django Install

# Create project structure
    <pre><code>
    django-admin startproject example
    scrapy-ctl.py startproject scraper
    mv scraper/* example
    rm -R scraper
    </code></pre>

# Add 'djangoscraper' to INSTALLED_APPS in django's settings.py
# Add 'djangoscraper.commands' to COMMANDS_MODULE in scrapy's settings.py
# To access scrapy from django, add the following code somewhere in django's settings.py
    <pre><code>
    os.environ.setdefault('SCRAPYSETTINGS_MODULE', 'scraper.settings')
    </code></pre>

# To access django from scrapy, add the following code somewhere in scrapy's settings.py
    <pre><code>
    os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'settings')
    </code></pre>

h4. Add djangoscraper to existing scrapy project

# Create django project
    <pre><code>
    django-admin startproject {django_project_name}
    </code></pre>    

# Move scraper into django project
    <pre><code>
    mv {scraper_project_dir}/* {django_project_name}
    </code></pre>

# Add 'djangoscraper' to INSTALLED_APPS in django's settings.py
# Add 'djangoscraper.commands' to COMMANDS_MODULE in scrapy's settings.py
# To access scrapy from django, add the following code somewhere in django's settings.py
    <pre><code>
    os.environ.setdefault('SCRAPYSETTINGS_MODULE', '{scraper_project_dir}.settings')
    </code></pre>

# To access django from scrapy, add the following code somewhere in scrapy's settings.py
    <pre><code>
    os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'settings')
    </code></pre>

h3. Configuration

h4. Creating a spider

h3. Usage