DS001--scraping-to-analysis--Extra-Store icon indicating copy to clipboard operation
DS001--scraping-to-analysis--Extra-Store copied to clipboard

:sparkles: The present project is a basic process pipeline of extrating, transforming, loading, analysing and presenting. All of that was made by using suitable tools of web scraping, data analysis/pr...

:tada: DS001 - Scraping to Analysis (Extra Store)

The present project is a basic process pipeline of extrating, transforming, loading, analysing and presenting. All of that was made by using suitable tools of web scraping, data analysis/presentation and databases.

Objectives:

  • Create a crawler able to scrape offers and reviews from Extra web store, more specifically, offers and reviews about coolers, televisions and printers;
  • Save the data in a database in an automated way;
  • Analyze products and reviews data;
  • Create a basic presentation using Extra offers information.

:computer: Step 1. Code code... and code

To code a programming to get web site information is needed a crawler (the crawler in DS001 project was made in Python and Scrapy). Looking at Extra web store source code and requests in browser we can find some API URL been triggered. Using API URLs the work becomes easier.

:motorway: Step 2. Choose a way to scrape and save the data

As reviews data can be extracted while scraping offers data, it's a good way to split the work into three spiders (coolers, televisions and printers spiders) without create additional spiders to reviews only. Basically, review objects are bigger than offer objects, then the impact of scraping the two together per spider isn't too severe. The crawler saves the data in MongoDB database itself using the files "pipelines.py" and "items.py".

:spider: Step 3. Run the spiders

Running the spiders with command "scrapy crawl <<SPIDER_NAME>>": step_3.1

So...

step_3.2

:floppy_disk: Step 4. Wait...

Data been saved in MongoDB database:

step_4

  • Products data format in database:

step_4.1

  • Reviews data format in database:

step_4.2

I early stoped the crawlers due the time to deliver the case :flushed:. So, the result... was about 31k data documents saved within MongoDB datase.

step_4.3

:dark_sunglasses: Step 5. Looking for a first undestanding about the data

MongoDB has its own tools to basic data analysis in database:

step_5

:chart_with_upwards_trend: Step 6. Making a deeper descriptive analysis

In a Jupyter Notebook some incredible things can be done. Python is a really flexible and versatile programming language. Using libraries/packages like Matplotlib, Pandas, Numpy, Seaborn a complete descriptive analysis is tangible.

step_6

:art: Step 7. Exporting data and making a simple presentation

  • Exporting products data from MongoDB as CSV:

step_7.1

  • Exporting reviews data from MongoDB as CSV:

step_7.2

All presentation was made in Power BI Desktop, an awesome tool to data visualization and presentation.

  • Iterative charts presentation in computer:

step_7.3

  • Iterative charts presentation in smartphone:

step_7.3

:rocket: The end.