querido-diario
querido-diario copied to clipboard
Como testar raspadores mas rápido? | How can we test spiders faster?
I'd like to create a script or at least a list of good practices to improve the process of reviewing spiders. I found these tips from @giuliocc in this comment:
-
-o output.jsonlines
: Exports the output items todata_collection/output.jsonlines
(useful to review if the items are being scraped as intended) -
-s LOG_FILE=logs.txt
: Logs will be redirected todata_collection/logs.txt
(useful to review if anything strange during the crawl occurred) -
-s FILES_STORE=""
: Disables file downloading (useful to run the entire crawl and not fill your hard drive hehe)
I was thinking that maybe having a script to extract some metrics could be useful. Here the metrics I have in mind:
- Last edition date
- Average of gazettes by month/year
- Exceptions and validation errors
What else do you get?
Some useful metrics:
- Returned HTTP status code when getting pages, grouped by 2xx, 4xx and 5xx
- Number of retries (if any) to download a file
- Database errors, constraints violations, failures to save the data
- Latency between getting pages, how much it takes to process the data and read a new page, helps to predict how much time it will take to collect everything
- Span (start and end time) to collect all the requested data
- Dependencies latency and errors, failures to save to S3, Telegram bot, etc
- Error rate, how much % of executions result in errors or fatal failures
Nice! Good ones @jaswdr. I'm still trying to find my way into getting metrics like exceptions and its details. If you have any ideas, just let me know. :) Also, in my PR I cover the basics and we can add all of them later.