li icon indicating copy to clipboard operation
li copied to clipboard

Scraper for Argentina

Open shaperilio opened this issue 4 years ago • 1 comments

https://www.argentina.gob.ar/coronavirus/informe-diario

This is from the federal government. They are publishing two PDFs per day. "Vespertino" = evening, "Matutino" = morning. They're probably meeting minutes.

Pros:

  • They maintain previous days' files online (but we should start caching just in case).
  • Later PDFs have cases tabulated by province.

Cons:

  • PDFs
  • Inconsistent filenames (must parse HTML links to get PDFs)
  • Additional information / data in paragraph form

shaperilio avatar Apr 03 '20 20:04 shaperilio

I've made an initial attempt at this in my argentina branch. Need help, because what i'm doing breaks caching.

From my slack message:

In short: there’s a webpage to links to PDFs; as of late there are two PDFs per day. So the strategy is to 1) parse the main page to get the links, and 2) get which PDFs are for the desired scrape date.

Their page maintains old PDFs, but our cache doesn’t. So I did something that’s probably inappropriate - I force getting the files even if they’re not in the cache. That way, if I try to scrape April 1st, it will get the main page, find the two PDFs for April 1st, and cache them. But the way it works now, it will replace files in the April 1st cache with what is retrieved today.

Eventually parsing this will be another story, but for now, I’d like someone to take a look and let me know what is a better way to at least retrieve all the PDFs and store them in our cache. I assumed that this strategy was better than caching every PDF for every day, but maybe that’s better?

shaperilio avatar Apr 05 '20 17:04 shaperilio