adstxtcrawler icon indicating copy to clipboard operation
adstxtcrawler copied to clipboard

A reference implementation in python of a simple crawler for Ads.txt

Results 13 adstxtcrawler issues
Sort by recently updated
recently updated
newest added

$ cat target_domains.txt #https://chicagotribune.com #http://latimes.com/sports #washingtonpost.com #https://www.vnads.net/index.html www.vnads.net/ads.txt

If a server declares the character set for the text file with "text/plain; charset=UTF-8" (as it should), adstxtcrawler gets an HTTP 406 (Not acceptable) response, instead of downloading the ads.txt...

row -> ahost for single threaded runs

open tmp csv file in universal new line mode to prevent some errors. I dont think it needs to be opened in binary mode.

We noticed this bot was crawling/fetching app-ads.txt one of our dev sites. ``` ::ffff:127.0.0.1 - - [17/Jul/2020:19:29:54 +0000] "GET /app-ads.txt HTTP/1.0" 404 - "-" "AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler" ``` - [ ]...

Please consider moving this to `/.well-known/` https://tools.ietf.org/html/rfc5785

Test case results in 4-5 redirects due to paywall. This misconfiguration should be handle elegantly, with a clear max on redirects, rather than failing to parse paywall text. - Consider...

Test case results in HTML file being sent to parse, generating unicode parsing error. - Consider verifying that the first non-comment line contains exactly 3 or 4 fields around adstxt_crawler.py...

google.com, pub-2788282916770715, DIRECT, f08c47fec0942fa0