allitebooks icon indicating copy to clipboard operation
allitebooks copied to clipboard

write scrapers for some other websites

Open moghya opened this issue 7 years ago • 12 comments

Following websites can be scraped

  1. http://bookboon.com

moghya avatar Oct 23 '17 18:10 moghya

Such as?

cLupus avatar Oct 23 '17 18:10 cLupus

@cLupus Thanks for showing interest in this project. I hope you visited http://moghya.me/allitebooks and got what we're trying to do here.

You can go through http://bookboon.com and try to wrtite scraper for it.

I'll add many such websites soon. Let me know if you gonna do it. I'll assign this to you :)

moghya avatar Oct 23 '17 18:10 moghya

I got to take a look, on the site, as well as in your repo. Am I correct in understanding that this issue is concerned with creating a scraper that creates a file similar to data.py?

cLupus avatar Oct 23 '17 18:10 cLupus

Yes, you're correct. It's just we dump the dictionary in JSON and process that JSON.

moghya avatar Oct 23 '17 18:10 moghya

That does sound interesting. I assume the description should be in english. However, the site does offer some additional languiages, although not all the descriptions have been translated into the different languages. Is there any plan for localization (or at the very least to grab what's there in different languages)?

cLupus avatar Oct 23 '17 18:10 cLupus

Honestly I didn't think of it. But as you have rightly raised we have to think about it ? What do you propose ?

moghya avatar Oct 23 '17 18:10 moghya

On closer inspection, it seems that only the site have been translated, and not the titles, or the descriptions, and such it would seem not to add much value (in the first run, anyway).

cLupus avatar Oct 23 '17 19:10 cLupus

let's work it for English and we'll come up with solution in near future

moghya avatar Oct 23 '17 19:10 moghya

Another issue, is that http://bookboon.com 'locks' their books behind a dropdown, and do not offer direct links to their books. There are some ways to aliviate it

  1. Download the zip-files, and host them (somewhere) by link.
  2. Do some trickery with cookies that are sent along with the request
  3. Something else?

cLupus avatar Oct 23 '17 19:10 cLupus

downloading zip one option but, maybe intercepting the request which downloads the book will solve our problem. Think it this way: scraper won't follow what bookboon, it'll work a step ahead we can workaround and get to know what exactly happens after filling the details and instead of filling the details we can directly send the request to download pdf.

moghya avatar Oct 24 '17 02:10 moghya

Hi there, ladies and gentlemen. What's the status on this issue? @moghya Mind if I hop in? Also, shouldn't the first page be a bit more descriptive? I.E. A huge majority of web pages should have written somewhere in the homepage what it is and what it does, not down in the code.

Let me know what you think!

EmilLuta avatar Oct 24 '17 19:10 EmilLuta

@EmilLuta maybe you can contribute by working on #3.

moghya avatar Oct 24 '17 21:10 moghya