ripme icon indicating copy to clipboard operation
ripme copied to clipboard

500px now rips non-water marked images

Open cyian-1756 opened this issue 7 years ago • 8 comments

The 500px ripper now rips images without a water mark on them closing issue #491. There are still some issues with the ripper (It takes a long while to start ripping and doesn't save the image titles) but those can be fixed later

Test link http://500px.com/david-foto

cyian-1756 avatar Mar 25 '17 12:03 cyian-1756

My log btw: https://pastebin.com/ZpCgqdFC

metaprime avatar Apr 25 '17 10:04 metaprime

@metaprime

Looks good overall. Still managed to get an adult content placeholder image

Also it looks like after one rip of the example link I exceeded the rate limit, so I can't test again.

It looks like theres been some changes to the site since I wrote the ripper, I'll get on fixing these

cyian-1756 avatar Apr 25 '17 10:04 cyian-1756

Maybe it's best to avoid using images = doc.select("meta[property=og:image]"); completely, so we don't rely on <meta og:image... at all.

Then this check can be discarded: if (imageURL.contains("https://500px.com/graphics/nude/img_3"))

Because this placeholder URL could be different, or could change any time.

Instead, always extract the target URL(s) from here:

for (Element script : doc.select("head > script")) {
    if (script.html().contains("window.PxPreloadedData")) {

 ........

Because that script element with window.PxPreloadedData should always be present.

Hrxn avatar Apr 25 '17 16:04 Hrxn

@cyian-1756 any update on this one?

metaprime avatar Aug 11 '17 09:08 metaprime

They implemented some insane rate limiting (I was still getting IP banned after waiting 10 secs between requests) so I haven't really be able to do much testing (As I get pretty much insta banned)

cyian-1756 avatar Aug 11 '17 11:08 cyian-1756

Maybe we need to make the wait interval long and slightly randomized to get around bot-detection?

metaprime avatar Aug 12 '17 10:08 metaprime

^ 10 seconds and getting insta-banned is already a lot, so the base waiting time would have to be something like 15 or 20 seconds at minimum with 5-10 seconds range of randomization at minimum ... And those might not even be enough.

Tbh I'm very surprised how strict limiting they suddenly implemented.

rautamiekka avatar Aug 12 '17 11:08 rautamiekka

Maybe we need to make the wait interval long and slightly randomized to get around bot-detection?

That might work, I'll look into it.

Tbh I'm very surprised how strict limiting they suddenly implemented.

I wouldn't be shocked if they did it to combat ripme considering it went into effect pretty much right after I fixed this ripper and added watermark free ripping

cyian-1756 avatar Aug 12 '17 14:08 cyian-1756