cdxj-indexer icon indicating copy to clipboard operation
cdxj-indexer copied to clipboard

Ways of handling problematic WARC records

Open anjackson opened this issue 2 years ago • 1 comments

We've found some weird WARCs, looking like this:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/
WARC-Date: 2017-09-19T03:35:35Z
WARC-IP-Address: 176.58.112.27
WARC-Payload-Digest: sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU
WARC-Record-ID: <urn:uuid:d15353f7-1bb7-4441-92bf-1f2268639d52>
Content-Type: application/http; msgtype=response
Content-Length: 7026

19/Sep/2017:03:35:35 +0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET /wp-content/uploads/2015/05/i6d2e3jOCVVc-e1432221090328.jpg HTTP/1.1||
19/Sep/2017: 03:35:35 +0000|v1|24.18.58.84|thestar.ie|200|73232|162.13.191.183:80|0.061|0.374|GET /wp-content/uploads/2015/12/video-woman-abusing-mcdonalds-cookies-brandy-wooten-353018.jpg HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|5.62.39.244|markom2020.no|403|0|35.197.196.129:80|0.339|0.339|GET /?author=1 HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|54.82.184.78|thestar.ie|200|0|162.13.191.183:80|0.389|0.389|HEAD /about-us/out-in-the-open-ace-back-at-work-hours-after-pittsburgh-defeat/ HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|69.162.124.230|www.adventure-holidays.ie|301|178|35.197.246.117:80|0.022|0.022|GET / HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|180.76.15.136|www.mobyaffiliates.com|200|20225|35.197.249.238:80|0.945|0.945|GET /mobile-advertising-networks/?key-markets=japan+indonesia&targeting=custom+operator HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|51.255.71.100|thestar.ie|200|32351|162.13.191.183:80|0.416|0.416|GET /about-us/sharon-corr-we-dont-judge-age/ HTTP/1.0||
19/Sep/2017: 03:35:37 +0000|v1|5.9.60.241|gullfoss.is|200|32745|35.197.192.76:80|2.819|2.819|GET /shop/?_wpnonce=9c17844d42&add_to_wishlist=3015 HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|188.163.72.15|www.alnouran.com|200|18017|35.189.109.142:80|0.006|0.006|GET /en/corporate-governance/corporate-social-responsibilities/ HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|178.154.200.9|canieatthere.eu|301|178|104.155.26.132:80|0.018|0.018|GET /robots.txt HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|141.8.142.44|canieatthere.co.uk|301|178|104.155.26.132:80|0.016|0.016|GET / HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|54.80.111.161|ravatherm.com|200|201784|104.199.60.90:80|0.071|0.071|GET /files/2016/03/DoP_RAVATHERM_300WB180_SK.pdf HTTP/1.0||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|2117|35.189.99.79:80|0.060|0.060|GET /wp-content/uploads/2017/05/twitter.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|4746|35.189.99.79:80|0.060|0.060|GET /wp-content/uploads/2017/05/google-plus.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|1746|35.189.99.79:80|0.061|0.061|GET /wp-content/uploads/2017/05/facebook-icon.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|45.55.55.18|www.lanzarotesurf.com|200|28740|35.197.214.99:80|1.859|1.859|GET /es/reservas/surf-camp-nivel-intermedio/ HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|207.46.13.65|www.stickybottle.com|200|11209|134.213.209.62:80|1.007|1.007|GET /latest-news/hotly-contest-shay-elliott-memorial-in-prospect-as-top-men-fine-tune-ras-form/ HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|66.249.85.10|nutroexpertos.com|200|33426|35.189.69.242:80|0.022|0.022|GET /wp-content/uploads/2015/05/Ejercicio-perro-484x330.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|157.55.39.239|www.janminihane.co.uk|200|4319|35.197.245.96:80|0.017|0.017|GET /wp-includes/js/jquery/jquery-migrate.min.js HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|35.189.215.158|www.axbom.se|301|178|35.197.249.238:80|0.005|0.005|GET /feed/axbom-se HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|66.249.85.10|nutroexpertos.com|200|96474|35.189.69.242:80|0.016|0.395|GET /wp-content/uploads/2015/06/Post-c%C3%B3mo-ense%C3%B1ar-a-cachorro-a-hacer-sus-necesidades-sobre-los-peri%C3%B3dicos.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|66.249.85.8|nutroexpertos.com|200|65231|35.189.69.242:80|0.040|0.428|GET /wp-content/uploads/2014/12/garrapatas-en-perros-2-484x330.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|35.197.192.76|gullfoss.is|200|6275|35.197.192.76:80|0.005|0.007|GET /wp-content/uploads/2016/07/Logo-Gullfoss_website-XI-1.png HTTP/1.0||
19/Sep/2017: 03:35:40 +0000|v1|54.82.184.78|thestar.ie|200|0|162.13.191.183:80|0.424|0.424|HEAD /about-us/1ds-niall-ill-find-the-next-mcilroy/ HTTP/1.1||
19/Sep/2017: 03:35:40 +0000|v1|218.90.137.18|laorcare.com|200|0|35.189.99.79:80|1.079|1.079|HEAD /wp-json/oembed/1.0/embed?url=http%3A%2F%2Flaorcare.com%2F HTTP/1.1||
19/Sep/2017: 03:35:40 +0000|v1|218.90.137.18|laorcare.com|200|2507|35.189.99.79:80|0.006|0.006|GET /wp-json/oembed/1.0/embed?url=http%3A%2F%2Flaorcare.com%2F HTTP/1.1||
Server: nginx
Date: Tue, 19 Sep 2017 03:35:41 GMT
Content-Type: application/rss+xml; charset=UTF-8
Connection: close
X-Cacheable: CacheAlways: feed
Cache-Control: max-age=600, must-revalidate
X-Cache: MISS
X-Cache-Group: bot
X-Pingback: http://www.estiethirionphotography.co.za/xmlrpc.php
Link: <http://www.estiethirionphotography.co.za/wp-json/>; rel="https://api.w.org/"
Link: <http://wp.me/p2ZY6I-Mb>; rel=shortlink
X-Type: feed
ETag: "1f5dd55566f2f1de600da749924ac5fb-gzip"
X-Pass-Why: 
Last-Modified: Fri, 27 Jan 2017 11:12:31 GMT

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	>
<channel>
	<title>Comments on: Fransua &#038; Anne-Louise wedding</title>
	<atom:link href="http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/</link>
	<description>Photography</description>
	<lastBuildDate>Fri, 27 Jan 2017 11:12:31 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.8.1</generator>
	<item>
		<title>By: nastassja harvey</title>
		<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/#comment-9616</link>
		<dc:creator><![CDATA[nastassja harvey]]></dc:creator>
		<pubDate>Thu, 27 Oct 2011 17:41:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.estiethirionphotography.co.za/?p=2987#comment-9616</guid>
		<description><![CDATA[sooo mooi estie! :)]]></description>
		<content:encoded><![CDATA[<p>sooo mooi estie! 🙂</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kathryn van Eck</title>
		<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/#comment-9609</link>
		<dc:creator><![CDATA[Kathryn van Eck]]></dc:creator>
		<pubDate>Wed, 26 Oct 2011 09:48:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.estiethirionphotography.co.za/?p=2987#comment-9609</guid>
		<description><![CDATA[Beautiful work! I love the softness of your images and how you captured the couples joy.]]></description>
		<content:encoded><![CDATA[<p>Beautiful work! I love the softness of your images and how you captured the couples joy.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
19/Sep/2017:03:35:41 +0000|v1|194.66.232.93|www.estiethirionphotography.co.za|200|1985|162.13.104.162:80|5.773|5.773|GET /2011/10/fransua-anne-louise-wedding/feed/ HTTP/1.0||

which comes out as a malformed CDX record:

za,co,estiethirionphotography)/2011/10/fransua-anne-louise-wedding/feed 20170919033535 {"url": "http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/", "mime": "application/rss+xml", "status": "+0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET", "digest": "sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU", "length": "2785", "offset": "861793820", "filename": "test.warc.gz"}

But I think it'd be better to skip/drop these records?

anjackson avatar Dec 07 '22 13:12 anjackson

We also have redirects that point back to the web archive, that PyWB is unable to deal with (webrecorder/pywb#591) - it would be great to be able to filter our records because they redirect to a particular host (www.webarchive.org.uk in this case).

anjackson avatar Jan 30 '23 14:01 anjackson