querido-diario icon indicating copy to clipboard operation
querido-diario copied to clipboard

Como testar raspadores mas rápido? | How can we test spiders faster?

Open anapaulagomes opened this issue 3 years ago • 2 comments

I'd like to create a script or at least a list of good practices to improve the process of reviewing spiders. I found these tips from @giuliocc in this comment:

  • -o output.jsonlines: Exports the output items to data_collection/output.jsonlines (useful to review if the items are being scraped as intended)
  • -s LOG_FILE=logs.txt: Logs will be redirected to data_collection/logs.txt (useful to review if anything strange during the crawl occurred)
  • -s FILES_STORE="": Disables file downloading (useful to run the entire crawl and not fill your hard drive hehe)

I was thinking that maybe having a script to extract some metrics could be useful. Here the metrics I have in mind:

  • Last edition date
  • Average of gazettes by month/year
  • Exceptions and validation errors

What else do you get?

anapaulagomes avatar Aug 13 '21 07:08 anapaulagomes

Some useful metrics:

  1. Returned HTTP status code when getting pages, grouped by 2xx, 4xx and 5xx
  2. Number of retries (if any) to download a file
  3. Database errors, constraints violations, failures to save the data
  4. Latency between getting pages, how much it takes to process the data and read a new page, helps to predict how much time it will take to collect everything
  5. Span (start and end time) to collect all the requested data
  6. Dependencies latency and errors, failures to save to S3, Telegram bot, etc
  7. Error rate, how much % of executions result in errors or fatal failures

jaswdr avatar Aug 25 '21 20:08 jaswdr

Nice! Good ones @jaswdr. I'm still trying to find my way into getting metrics like exceptions and its details. If you have any ideas, just let me know. :) Also, in my PR I cover the basics and we can add all of them later.

anapaulagomes avatar Aug 26 '21 08:08 anapaulagomes