Christopher Foo comments

Results 14 comments of


                                            Christopher Foo

Send `Cookie: over18=1` to all reddit URLs

Implementation note: `pipeline.py` (15ae3ca6a6831f2b1ae366a58d5620474f5b3d2c) already adds this cookie for the top level URL.

Extract URLs from SWFs

An umbrella issue for the actual crawling implementation is at chfoo/wpull#74. This issue should cover adding a future Wpull option to the arguments.

The viewer is sometimes missing files that are available on IA

I guess looking at it now, it should be ordering by `oai_updatedate`. Keeping track of the last successful retrieval date is a good idea. Edit: I tried running it on...

DAEParser throws exception on Std::parseFloat on Neko target

So it seems like there isn't any noticeable problems removing the extra `parseFloat` and the LoadDAE example works. However, I discovered numerical issues with COLLADA files exported from Blender using...

DAEParser.supportsData assumes ByteArrayData.position is 0

Thank you, I tested it on Neko, C++, and HTML5 and I can confirm it works now.

Feature: extract WARCs specified with index/length

Also to add that currently Warcat uses Python's built in HTTP library which does not handle edge cases that web browsers do.

No mention of 'resource' in list at verify_refers_to

Good catch, it's not supposed to be missing that one. (As a FYI, this project was written based on the draft WARC 1.0 spec. I haven't updated the project since...

Add easy way to iterate over warc records

Sure, I think that sounds great!

Support older Python 2.7

Thanks, I'll be happy to accept pull requests. Please take your time to finish porting the code. Correctness is much more important than version compatibility.

Support warnings when WARC field name casing don't match hanzo's warc-tools.

See also: - https://bitbucket.org/hanzo/warc-tools/issue/11 - https://bitbucket.org/rajbot/warc-tools/issue/1 - https://github.com/internetarchive/CDX-Writer/issues/3