ripme icon indicating copy to clipboard operation
ripme copied to clipboard

500px: Downloaded images have watermark

Open gomrcong opened this issue 8 years ago • 11 comments

I downloaded some files from 500px but all of them have watermark present on it. Ex: 2017-03-25_135743

gomrcong avatar Mar 25 '17 07:03 gomrcong

what url are you ripping?

cyian-1756 avatar Mar 25 '17 07:03 cyian-1756

link 1, link 2 ...

gomrcong avatar Mar 25 '17 07:03 gomrcong

What version of ripme are you using? Because both those links throw unexpected url formats for me

cyian-1756 avatar Mar 25 '17 07:03 cyian-1756

I'm using ver 1.4.6

gomrcong avatar Mar 25 '17 07:03 gomrcong

That's odd, I'm on the same version. Are you sure those are the links that you're entering into ripme? Because taking a quick look at the code the ripper shouldn't see them as valid

cyian-1756 avatar Mar 25 '17 07:03 cyian-1756

please watch this video https://streamable.com/3zo5g Link that I want to download photos: https://500px.com/david-foto

gomrcong avatar Mar 25 '17 07:03 gomrcong

So the url you're ripping is http://500px.com/david-foto. After messing with a few test pages it looks like 500px is trying to water mark images links that are likely to be grabbed by bots or hot linked, but it seems that there is a work around if you use the twitter link instead. I'll get on fixing it in ripme

cyian-1756 avatar Mar 25 '17 07:03 cyian-1756

Thank you!

gomrcong avatar Mar 25 '17 07:03 gomrcong

Let me chime in for a second..

  1. Some 500px profile pages have watermarks (of the site), that is a 500px account specific setting. There is no way around that, I think I've already tried every userscript, extension and program, and even tried it manually, but you can't access other versions of the image, let alone the real original. But: https://500px.com/david-foto does not belong to these, so in this case the watermark thing should be possible to fix.

  2. I've seen other userscripts etc. tried to use the Twitter meta element, or the good old property='og:image', but this is a bad idea for 500px, because any photo that is marked as "Adult Content" (even very slight nudity sometimes, ridiculous) gets you this: https://500px.com/graphics/nude/img_3.png (The placeholder)

  3. I checked the site source, looks still the same to me, so all the URLs that are actually 'hidden' in there somehow are still there.

IIRC, RipMe uses the API of 500px to get the Images (in a profile, or gallery, etc.), and then fetches the HTML document and tries to extract the desired information.

Let me about the approach you're trying here, maybe I can help eventually.

Hrxn avatar Mar 25 '17 08:03 Hrxn

@Hrxn I've just run into the "adult image" issue and I'm planing on just falling back to the api for those images (and them having water marks) unless I can find a way around it

Edit: ~~after a bit of looking I can't find a way to download adult images without using the api~~

Edit_2: nvm it looks like the url to the full sized image can be found in the javascript of the page, but parsing it with jsoup will be a pain

cyian-1756 avatar Mar 25 '17 08:03 cyian-1756

Yeah.. inside the script element that starts with window.PxPreloadedData. That didn't change at all.

But they apparently now have a new script element inside their HTML, that contains a function with this "PHOTO_GRID_IMAGE_SIZES: ["1", "2", "32", "31", "33", "34", "35", "36", "2048", "4", "14"],"

Which probably explains why FivehundredpxRipper.java#L288 fails now.

Here is what HandyImage does:

case "500px.com":
		find_text_in_scripts('"https_url":"', '"', false);
		break;

calls this

function find_text_in_scripts(a, b, o, h)
{
	var s = document.getElementsByTagName("script");
	for(var c=0;c<s.length;c++) 
	{
		if(h && s[c].innerHTML.indexOf(h) != -1){s[c].innerHTML = s[c].innerHTML.substring(0, s[c].innerHTML.indexOf(h));}
		var start_pos = o ? s[c].innerHTML.indexOf(a) : s[c].innerHTML.lastIndexOf(a);
		if(start_pos == -1){continue;}
		start_pos += a.length;
		i = s[c];
		i.src = decodeURIComponent(s[c].innerHTML.substring(start_pos,s[c].innerHTML.indexOf(b,start_pos)).split("\\").join("")); // split\join fix for stupidfox GreaseMonkey
		return true;
	}
	return false;
}

Using lastIndexOf, for getting the right start_pos with searching for "https_url". Which should hopefully land at the URL part of size : 2048 ;)

Hrxn avatar Mar 25 '17 09:03 Hrxn