Greg Lindahl

Results 182 comments of Greg Lindahl

Ah, yes, that's a good one. There are a lot of request json payloads out there with content-type text/plain. And a revisit would potentially have the same situation. continuation records...

I'm in favor of a single field, comma-separated. Note that the clock has pretty much ticked out on this discussion... the minute that a large web player starts discriminating against...

Another example from in the wild: wget generates warcs without the space.

Another fully-working open-source example of complicated capture and playback is https://github.com/webrecorder/webrecorder

Just input encoding. I figure anyone who wants a different output for encoding should do their own work! But it's no surprise that input html pages don't always have utf-8...

Is that valid html? Shouldn't it be: ``` 98 T.C. 14198 T.C. 141 ```

If it's non-standard HTML I would try to pre-process it into standard html... if it's just this one case it's not too hard.

@FabioKn can you show us how you triggered this error? Was it reading a warc file, etc Oh, I see that you work with @white-gecko and that is the trigger....

The main purpose of the news index is to checksum it for integrity reasons. I haven't given any thought to making the parquet available, were people might want a single...