job_scraper icon indicating copy to clipboard operation
job_scraper copied to clipboard

A job scraper using the Scrapy framework

Build Status

Simple Job Scraper

~~Searches Stackoverflow & Dice for jobs and saves the results to DynamoDB.~~

Job site aggregator. Scrapes results from multiple job sites and returns result to web page.

Uses AWS Lambda, Python, Scrapy & Travis CI.

Will eventually use Django for the web app.

To Do (Adapt to new architecture)

  • Scrapy saves job items as list of dictionaries (1 per job)
  • Convert list of dicts to json object
  • Return json of processed jobs from AWS Lambda function, build job elements on page from return json.
  • Invoke lambda function directly from static page using AWS Javascript SDK. (Remove DynamoDB)
  • Add search box and button to front end to invoke lambda function. (Start with job titles)
  • Pass arguments into scrapy to use for searching, pass data from lambda invocation in javascript from static page.

Avoid using API gateway and DynamoDB. Invoke the lambda function directly from the page and then return the results. No need to store long term if it's fast enough!

Resources:

Scrapydo documentation (See the scrapydo.run_spider example)

Pass user defined arguments into scrapy spiders

Save scrape results into list of dicts

Convert list of dicts to json in python

Invoke a Lambda Function from javascript

Building a python AWS Lambda deployment package


Old Resources

Build An API To Expose An AWS Lambda Function

Run Scrapy From A Script

Scrapy Script

Scrapy Throws ReactorNotRestartable on AWS Lambda

Create an AWS Lambda Deployment Package for Python

AWS Lambda Function Handler for Python