querido-diario Como testar raspadores mas rápido? | How can we test spiders faster?

Como testar raspadores mas rápido? | How can we test spiders faster?

Open anapaulagomes opened this issue 3 years ago • 2 comments

I'd like to create a script or at least a list of good practices to improve the process of reviewing spiders. I found these tips from @giuliocc in this comment:

-o output.jsonlines: Exports the output items to data_collection/output.jsonlines (useful to review if the items are being scraped as intended)
-s LOG_FILE=logs.txt: Logs will be redirected to data_collection/logs.txt (useful to review if anything strange during the crawl occurred)
-s FILES_STORE="": Disables file downloading (useful to run the entire crawl and not fill your hard drive hehe)

I was thinking that maybe having a script to extract some metrics could be useful. Here the metrics I have in mind:

Last edition date
Average of gazettes by month/year
Exceptions and validation errors

What else do you get?

Aug 13 '21 07:08 anapaulagomes

Some useful metrics:

Returned HTTP status code when getting pages, grouped by 2xx, 4xx and 5xx
Number of retries (if any) to download a file
Database errors, constraints violations, failures to save the data
Latency between getting pages, how much it takes to process the data and read a new page, helps to predict how much time it will take to collect everything
Span (start and end time) to collect all the requested data
Dependencies latency and errors, failures to save to S3, Telegram bot, etc
Error rate, how much % of executions result in errors or fatal failures

Aug 25 '21 20:08 jaswdr

Nice! Good ones @jaswdr. I'm still trying to find my way into getting metrics like exceptions and its details. If you have any ideas, just let me know. :) Also, in my PR I cover the basics and we can add all of them later.

Aug 26 '21 08:08 anapaulagomes

querido-diario querido-diario copied to clipboard

Como testar raspadores mas rápido? | How can we test spiders faster?

querido-diario
querido-diario copied to clipboard