packtpub-crawler icon indicating copy to clipboard operation
packtpub-crawler copied to clipboard

I suppose this is the end of packtpub-crawler?

Open lucymhdavies opened this issue 7 years ago • 27 comments

https://www.packtpub.com/packt/offers/free-learning

screen shot 2017-05-26 at 10 21 35

lucymhdavies avatar May 26 '17 09:05 lucymhdavies

They have done it before as part of some a/b tests, hopefully they revert it back after the stats drop (I don't think people manually check the site every day).

Maybe we can contact them, since this script turns a daily chore into a pleasant experience and all their free books are already downloadable from other sources anyways.

But otherwise, we can't do much about it

juzim avatar May 26 '17 09:05 juzim

oh no! just started implementing this script with the packtpub Alexa skill yesterday! How frustrating!

deliussed avatar May 26 '17 12:05 deliussed

I have added the book title and the claim URL in the error messages, this way we can at least check if the book is interesting enough to claim it manually. https://github.com/niqdev/packtpub-crawler/pull/71

Still, this is a really stupid move, I immediately lost all interest in visiting packtpub :/

juzim avatar Jun 01 '17 10:06 juzim

That's a useful feature at least. Shame we can't automatically claim them anymore :(

lucymhdavies avatar Jun 01 '17 10:06 lucymhdavies

going to close this, as https://github.com/niqdev/packtpub-crawler/pull/71 has now been merged

lucymhdavies avatar Jun 02 '17 11:06 lucymhdavies

I have created a new branch with a proposal, I don't know if is worth it spend time.

I have fixed the claim, looking at the docs the recaptcha-token field should always be available in the page, but needs to be validated by the client and can be used only once. If you solve the captcha manually and plug the token here you are able to download the book. If you run the script with an invalid captcha it will download the latest book claimed with the wrong title.

Would be interesting, just for fun, to try to de-couple the claim from the rest, solving only the captcha via mail :blush:

By the way, this document (although I think is already obsolete) is an alternative, but I don't think should be the way to go :disappointed:

niqdev avatar Jun 02 '17 11:06 niqdev

Since we have duplicated issues #75 #76 related to this one I will re-open it.

The problem is related to the captha and the error looks like this

[-] <type 'exceptions.IndexError'> list index out of range | spider.py@97
Traceback (most recent call last):
  File "script/spider.py", line 97, in main
    packtpub.runDaily()
  File "/home/ubuntu/Projects/github/packtpub-crawler/script/packtpub.py", line 161, in runDaily
    self.__parseDailyBookInfo(soup)
  File "/home/ubuntu/Projects/github/packtpub-crawler/script/packtpub.py", line 93, in __parseDailyBookInfo
    self.info['url_claim'] = self.__url_base + div_target.select('a.twelve-days-claim')[0]['href']
IndexError: list index out of range

There is a a feature branch with a proposal, but it could be a black hole!

niqdev avatar Jun 20 '17 13:06 niqdev

@niqdev really there is the problem with captcha, still, it doesn't work. Maybe implement it by using two steps with opened the page? as one more option

develsites avatar Jun 20 '17 18:06 develsites

@develsites yep that was the idea/proposal in the feature branch, 2 step process solving the captcha manually via email for example, but unfortunately yes at the moment the script is broken and we can't do much

niqdev avatar Jun 22 '17 14:06 niqdev

Honestly, if you have to solve the captcha manually anyway, then you may as well just go to https://www.packtpub.com/packt/offers/free-learning and claim it manually.

Packtpub-crawler is still useful for notifying what the latest book is though :)

lucymhdavies avatar Jun 22 '17 14:06 lucymhdavies

i had no captcha today... is it an error or did they remove it? Claiming still worked

Nightreaver avatar Jul 17 '17 18:07 Nightreaver

umh, something changed for sure, the reCAPTCHA moved to the bottom-right of the page. Were you able to download the book with the script?

niqdev avatar Jul 17 '17 19:07 niqdev

Hi

I didn't use the script but I was able to get the book manually without validating reCAPTCHA.

Thanks.

Le lun. 17 juil. 2017 à 21:09, niqdev [email protected] a écrit :

umh, something changed for sure, the reCAPTCHA moved to the bottom-right of the page. Were you able to download the book with the script?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/niqdev/packtpub-crawler/issues/70#issuecomment-315851723, or mute the thread https://github.com/notifications/unsubscribe-auth/AC6cEF-HlWIVHN0GCb0EGbUZygY6wLA5ks5sO7EfgaJpZM4NnXgq .

tpoindessous avatar Jul 17 '17 19:07 tpoindessous

The CAPTCHA has not yet returned, but the script fails to claim the book with IndexError('list index out of range',).

brechtm avatar Jul 18 '17 09:07 brechtm

yeah, you dont have to "do" anything for the captcha to work... maybe it detects the browser or something? For me, using chrome, it just works. no box, nothing, but blocking google prints the error "no captcha" or whatever

its new kind of captcha from google?

"insible recaptcha" - https://developers.google.com/recaptcha/docs/invisible

Nightreaver avatar Jul 24 '17 16:07 Nightreaver

Yes, they seem to analyze things like mouse movement patterns. It's called "invisible recaptcha" and it's really interesting when you are into machine learning.

juzim avatar Jul 25 '17 18:07 juzim

Hello, we have managed to solve the captcha to make my script-grabber working, You can use the same solution or check mine at: https://github.com/igbt6/Packt-Publishing-Free-Learning Regards!

luk6xff avatar Aug 22 '17 22:08 luk6xff

@igbt6 That's awesome, thanks a lot for sharing with us!

niqdev avatar Aug 23 '17 08:08 niqdev

@niqdev I managed to get my Packt grabber working by using Selenium in headless mode AND setting useragent to Chrome (default for headless Chrome is, if I recall correctly, WebdriverChrome).

katka-n avatar Oct 06 '17 15:10 katka-n

@katka-n great! is it easy to integrate with the current project?

niqdev avatar Oct 06 '17 18:10 niqdev

@Hacktoberfest Anyone interested in integrating Anti Captcha or other solutions? Thanks

niqdev avatar Oct 06 '17 18:10 niqdev

@niqdev I am not that experienced but I will try to do so, if I succeed I will create a pull request ;)

Update: I got the basic downloading to the user's account working, but the script stops at downloading a file to the drive.

katka-n avatar Oct 06 '17 18:10 katka-n

here is a python solution for the recaptcha https://github.com/ecthros/uncaptcha

tjnel avatar Oct 26 '17 05:10 tjnel

Thanks @tjadanel , any interest in integrate it?

niqdev avatar Oct 27 '17 17:10 niqdev

I see that they have removed the recaptcha batch from the site? could this mean that recaptcha is removed? I tried running the script and got list index out of range which either means that recaptcha is still in place or that the structure of the site has changed. Will investigate though. If you don't hear from me either I haven't gotten anywhere or recaptcha is still in place

justingiffard avatar Jan 21 '18 09:01 justingiffard

@justingiffard There is still reCaptcha used by Packt, They just switched to so called invisible reCaptcha. Use my script instead: https://github.com/igbt6/Packt-Publishing-Free-Learning which will do the work for you ; )

luk6xff avatar Jan 21 '18 10:01 luk6xff

@igbt6 thanks but you make use of a service which is not free (albeit cheap)

justingiffard avatar Jan 21 '18 11:01 justingiffard