link_thumbnailer
link_thumbnailer copied to clipboard
check content-type via headers first, instead of loading entire document into ram
LinkThumbnailer::Processor should query HTTP Response headers and decide upon content-length
and content-type
if it should run http.request(url)
.
This allows to raise LinkThumbnailer::FormatNotSupported
without downloading the entire document into memory first, and also to raise a new LinkThumbnailer::FileSizeExceeded
in case a document is too large.
(In 20K urls extracted from tweets I have not found a single text/html
document with > 600KB, they all are image/*
, application/pdf
, video/*
or audio/*
, so using a limit of 1M should be safe).
Example: one of the urls LinkThumbnailer (run by a heroku worker) attempted to parse was a 100MB file:
http://cdn-storage.br.de/MUJIuUOVBwQIbtChb6OHu7ODifWH_-by/_-iS/9-bp5-8G/8e5300aa-b158-4ded-8ceb-30a1105f806f_3.mp3
This is a problem when your available RAM is limited and you run multiple threads parallel.
The HTTP Response Headers of the requested document should allow to skip this file before downloading it:
{ "server" => "Apache",
"etag" => "\"4bae10f548f0b7671b76b009c50a3fb5:1452175291\"",
"last-modified" => "Thu, 07 Jan 2016 14:01:31 GMT",
"accept-ranges" => "bytes",
"content-length" => "100665096",
"content-type" => "audio/mpeg",
"date" => "Sat, 09 Jan 2016 21:24:30 GMT",
"connection" => "keep-alive" }
Sounds like a great idea. Feel free to make a PR for it. I'll have a look myself how we can integrate this.
Would need this issue to be fixed first https://github.com/typhoeus/typhoeus/issues/511#issuecomment-192720254