link_thumbnailer check content-type via headers first, instead of loading entire document into ram

check content-type via headers first, instead of loading entire document into ram

Open maia opened this issue 9 years ago • 2 comments

LinkThumbnailer::Processor should query HTTP Response headers and decide upon content-length and content-type if it should run http.request(url).

This allows to raise LinkThumbnailer::FormatNotSupported without downloading the entire document into memory first, and also to raise a new LinkThumbnailer::FileSizeExceeded in case a document is too large.

(In 20K urls extracted from tweets I have not found a single text/html document with > 600KB, they all are image/*, application/pdf, video/* or audio/*, so using a limit of 1M should be safe).

Example: one of the urls LinkThumbnailer (run by a heroku worker) attempted to parse was a 100MB file:

http://cdn-storage.br.de/MUJIuUOVBwQIbtChb6OHu7ODifWH_-by/_-iS/9-bp5-8G/8e5300aa-b158-4ded-8ceb-30a1105f806f_3.mp3

This is a problem when your available RAM is limited and you run multiple threads parallel.

The HTTP Response Headers of the requested document should allow to skip this file before downloading it:

{ "server"         => "Apache",
  "etag"           => "\"4bae10f548f0b7671b76b009c50a3fb5:1452175291\"",
  "last-modified"  => "Thu, 07 Jan 2016 14:01:31 GMT",
  "accept-ranges"  => "bytes",
  "content-length" => "100665096",
  "content-type"   => "audio/mpeg",
  "date"           => "Sat, 09 Jan 2016 21:24:30 GMT",
  "connection"     => "keep-alive" }

Jan 10 '16 11:01 maia

Sounds like a great idea. Feel free to make a PR for it. I'll have a look myself how we can integrate this.

Jan 18 '16 09:01 gottfrois

Would need this issue to be fixed first https://github.com/typhoeus/typhoeus/issues/511#issuecomment-192720254

Mar 07 '16 14:03 gottfrois

link_thumbnailer link_thumbnailer copied to clipboard

check content-type via headers first, instead of loading entire document into ram

link_thumbnailer
link_thumbnailer copied to clipboard