ripme icon indicating copy to clipboard operation
ripme copied to clipboard

Fall back on catch-all direct media rip for links to .jpg, .gif, .gifv, .webm, .mp4, etc.

Open metaprime opened this issue 8 years ago • 4 comments

There's probably no need to have specific logic for each domain.

Ripme should trivially be able to download any media from a direct link to that media file. If ripping in this way, we might need to simply output to an images or videos folder because we have no other information, but when these media are linked from e.g. reddit, no matter the host, we should be able to support it.

This would aid with direct links to imgur posts, eroshare posts, etc. (see #399)

File extensions to support (probably will add more):

jpg
jpeg
png
gif
gifv
webm
mp4

MIME types to support (need to fill in this list):

image/jpeg
image/png
...

Unsure whether the best way to detect the type of media is by MIME type or by file extension in the URL, or some combination of the two.

metaprime avatar Dec 31 '16 00:12 metaprime

I don't know how it's done, and only know from Firefox, but even if you don't specify the extension for a pic on the computer and in the code, the type is auto-detected; at least valid PNGs are perfectly auto-detected. Either the magic detection system widely used in Linux (file command is one largely used frontend to it) or a series of distinct tests are run and the one returning least complaints is used.

Looking at British/American/Finnish Google results, there are varying methods integrated in Java since old times to auto-detect MIME of an URL, but it all requires to request the file and read the header before deciding. There should be categories on Wikipedia to group different pic formats together, and most of times each type's MIME is told in its article, which makes your life making lists easier.

Although I know it's a possibility, it's not the biggest one that a site would resort to giving randomized extensions (some do leave the extension out altogether as we saw in previous tickets) to files to evade the simplest bots reading the HTML. For example, Facebook does use randomized names for its files and for references in pure text files to other files, but not randomized file extensions nor randomized pure text code.

IMO using a generic code analysis would be best as first line of defence, then run auto-detection on suspects as second line if there's any reason to believe code analysis was inconclusive.

rautamiekka avatar Dec 31 '16 16:12 rautamiekka

I think simply believing the file extension in the URL would be enough to try downloading it. If the MIME type matches, the job is done. Sometimes websites will use a page.jpg actually deliver a page with that single image embedded in it. In that case, we could fall back on the single-image per page detection mechanism discussed in #361 (but that is a further enhancement that can come later). For a generic URL, if we request a URL that looks like a direct image link, and we get back a non-image (e.g. HTML, text, JSON) response, we don't save that page because it's not an image.

Due to general rate-limiting concerns it makes sense NOT to request a URL unless we're sure we would use it. For sites where the direct image links do not use recognized extensions, I think it should be the ripper's responsibility to handle those links and do the right thing, as they already do now.

Basically my thought here is that if a URL cannot be handled by any ripper as it is, we could fall back on generic image downloader behavior.

Pseudocode of algorithm for taking advantage of URL file extension and then confirming with MIME type:

if (url.matches(imageExtensionRegex)) { 
  response = request(url);
  if (isImageMimeType(response.mimeType)) {
    saveFile(response.data,
      ripRoot + convertExt(url.getFileNamePart(), 
        // save with correct extension -- handles sites like imgur who give e.g. png for some jpg links
        mapMimeToExt(reponse.mimeType))); 
  }
}

There may be some library function that has a heuristic to detect the true type of a blob of data. AFAIK, file extensions on the file system are used by the operating system to choose a default program to try to open them. Many applications on both Linux and Windows will then actually look at the data in the file to determine whether it is a type that the program understands. For images, see https://oroboro.com/image-format-magic-bytes/ -- for executable files, when *nix tries to execute them, it can use the #! line to determine which program should be used to execute them. It's all software tricks and varying levels of convenience when it comes down to it. Some such functionality might be exposed by the OS to Java, but more likely Java has a library implementation or there is a third party library to handle it.

metaprime avatar Dec 31 '16 19:12 metaprime

If ripping in this way, we might need to simply output to an images or videos folder because we have no other information

How about using the domain of the service and the filename? For example: https://i.imgur.com/foo123.jpg -> imgur_foo123.jpg? This way we can preserve (at least to some extent) the actual service and provide backwards compatibility if a service becomes officially supported in a later release.

ghost avatar Jan 01 '17 00:01 ghost

Agreed! I still think grouping single-image downloads under a folder like ./rips/images or even splitting them up by-domain instead of putting the domain directly in the filename might work, too: ./rips/imgur/foo123.jpg or ./rips/images/imgur/foo123.jpg

metaprime avatar Jan 01 '17 20:01 metaprime