500px: Downloaded images have watermark
I downloaded some files from 500px but all of them have watermark present on it.
Ex:

what url are you ripping?
What version of ripme are you using? Because both those links throw unexpected url formats for me
I'm using ver 1.4.6
That's odd, I'm on the same version. Are you sure those are the links that you're entering into ripme? Because taking a quick look at the code the ripper shouldn't see them as valid
please watch this video https://streamable.com/3zo5g Link that I want to download photos: https://500px.com/david-foto
So the url you're ripping is http://500px.com/david-foto. After messing with a few test pages it looks like 500px is trying to water mark images links that are likely to be grabbed by bots or hot linked, but it seems that there is a work around if you use the twitter link instead. I'll get on fixing it in ripme
Thank you!
Let me chime in for a second..
-
Some 500px profile pages have watermarks (of the site), that is a 500px account specific setting. There is no way around that, I think I've already tried every userscript, extension and program, and even tried it manually, but you can't access other versions of the image, let alone the real original. But: https://500px.com/david-foto does not belong to these, so in this case the watermark thing should be possible to fix.
-
I've seen other userscripts etc. tried to use the Twitter meta element, or the good old
property='og:image', but this is a bad idea for 500px, because any photo that is marked as "Adult Content" (even very slight nudity sometimes, ridiculous) gets you this: https://500px.com/graphics/nude/img_3.png (The placeholder) -
I checked the site source, looks still the same to me, so all the URLs that are actually 'hidden' in there somehow are still there.
IIRC, RipMe uses the API of 500px to get the Images (in a profile, or gallery, etc.), and then fetches the HTML document and tries to extract the desired information.
Let me about the approach you're trying here, maybe I can help eventually.
@Hrxn I've just run into the "adult image" issue and I'm planing on just falling back to the api for those images (and them having water marks) unless I can find a way around it
Edit: ~~after a bit of looking I can't find a way to download adult images without using the api~~
Edit_2: nvm it looks like the url to the full sized image can be found in the javascript of the page, but parsing it with jsoup will be a pain
Yeah..
inside the script element that starts with window.PxPreloadedData.
That didn't change at all.
But they apparently now have a new script element inside their HTML, that contains a function with this "PHOTO_GRID_IMAGE_SIZES: ["1", "2", "32", "31", "33", "34", "35", "36", "2048", "4", "14"],"
Which probably explains why FivehundredpxRipper.java#L288 fails now.
Here is what HandyImage does:
case "500px.com":
find_text_in_scripts('"https_url":"', '"', false);
break;
calls this
function find_text_in_scripts(a, b, o, h)
{
var s = document.getElementsByTagName("script");
for(var c=0;c<s.length;c++)
{
if(h && s[c].innerHTML.indexOf(h) != -1){s[c].innerHTML = s[c].innerHTML.substring(0, s[c].innerHTML.indexOf(h));}
var start_pos = o ? s[c].innerHTML.indexOf(a) : s[c].innerHTML.lastIndexOf(a);
if(start_pos == -1){continue;}
start_pos += a.length;
i = s[c];
i.src = decodeURIComponent(s[c].innerHTML.substring(start_pos,s[c].innerHTML.indexOf(b,start_pos)).split("\\").join("")); // split\join fix for stupidfox GreaseMonkey
return true;
}
return false;
}
Using lastIndexOf, for getting the right start_pos with searching for "https_url".
Which should hopefully land at the URL part of size : 2048 ;)