dime-python-training icon indicating copy to clipboard operation
dime-python-training copied to clipboard

Consider adding practical tips and consideration on scraping large dataset to part 2

Open weilu opened this issue 3 years ago • 0 comments

https://github.com/worldbank/dime-python-training/blob/31ed34d7053eedbf5b73bc4af22c2f819ca607a9/I%20-%20Introduction/slides/intro-to-python-CE-part2.tex#L110

Practically there are some considerations when it comes to scraping that might worth highlighting:

  • Error handling – if unhandled, this could stop/crash scraper scripts and lose data already scraped and held in memory. Tackling this usually involves extensive testing of the scripts and adding conditional checks (e.g. element exists before getting content from its child element) and/or catching discovered exceptions and save partial data if necessary.
  • Progress tracking – simple measures like making sure the order of data gets downloaded is consistent (like chronologically, or by other sort order), saving downloaded html files, marking sections of the data with "done" flags can not only help monitor the progress but also allows for stop and restart without repeating work.
  • Consider using data scraping libraries like scrapy as they provide a framework that already take care of the common considerations and best practices in web scraping.

On large dataset:

In my past scraping projects I often find df.to_csv replaced with csvwriter.writerow in a scraper project as it is more memory friendly if the dataset is large and scraping process long-running. Of course there are more than one way of handling situations like this, for example, one can save records extracted from individual html pages into their own csv and have another script to consolidate them.

weilu avatar Jan 28 '22 16:01 weilu