adstxtcrawler issues

RENAME.MD

1

$ cat target_domains.txt #https://chicagotribune.com #http://latimes.com/sports #washingtonpost.com #https://www.vnads.net/index.html www.vnads.net/ads.txt

sonvo9900

Accept: text/plain; charset=UTF-8

If a server declares the character set for the text file with "text/plain; charset=UTF-8" (as it should), adstxtcrawler gets an HTTP 406 (Not acceptable) response, instead of downloading the ads.txt...

wrmike1

fix variable name in single threaded runs

row -> ahost for single threaded runs

galtay

use universal line mode when opening tmp csv file

open tmp csv file in universal new line mode to prevent some errors. I dont think it needs to be opened in binary mode.

galtay

Honour robots.txt

We noticed this bot was crawling/fetching app-ads.txt one of our dev sites. ``` ::ffff:127.0.0.1 - - [17/Jul/2020:19:29:54 +0000] "GET /app-ads.txt HTTP/1.0" 404 - "-" "AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler" ``` - [ ]...

cesine

./well-known

Please consider moving this to `/.well-known/` https://tools.ietf.org/html/rfc5785

tosh

Following too many Redirects Is Problematic

4

Test case results in 4-5 redirects due to paywall. This misconfiguration should be handle elegantly, with a clear max on redirects, rather than failing to parse paywall text. - Consider...

BrendanIAB

adstxtcrawler
adstxtcrawler copied to clipboard

Metadata

RENAME.MD

Accept: text/plain; charset=UTF-8

fix variable name in single threaded runs

use universal line mode when opening tmp csv file

Honour robots.txt

./well-known

Following too many Redirects Is Problematic

Crawler doesn't fail if the first data line contains non-schema data

Myprojectgc

Create Imam

← Metadata

Owner

Metadata

adstxtcrawler adstxtcrawler copied to clipboard

Metadata

← Metadata

Owner

Metadata

adstxtcrawler
adstxtcrawler copied to clipboard