hackr icon indicating copy to clipboard operation
hackr copied to clipboard

Extending the web scraper.

Open ba11b0y opened this issue 8 years ago • 13 comments

There hasn't been much work on the web scraping part. I am interested to work on this. Since this is going to be a generic one, what I have thought as of now includes:

  1. A generic web scraper which scrapes all images, links and the text.
  2. Use scrapy for this maybe.

Still a beginner, any tips or corrections?

ba11b0y avatar Oct 04 '17 04:10 ba11b0y

@invinciblycool I like the thought, I would suggest, a detailed list of missing components you find in the current code of scraper, then we will assign you the work.

shubhodeep9 avatar Oct 04 '17 06:10 shubhodeep9

@invinciblycool XML format could be added.

ashwini0529 avatar Oct 04 '17 13:10 ashwini0529

@ashwini0529 I have added the XML response to web.py. Let me know if any corrections are needed @shubhodeep9 I will update the detailed list as soon as my exams get over :smile:

ba11b0y avatar Oct 05 '17 07:10 ba11b0y

@ashwini0529 @shubhodeep9 Couldn't resist the excitement :smile: These are some features in my mind which can be added :

  • [ ] If no JSON response is returned by the URL, only the source of the page is returned. We could have a more better scraper which returns either:
  1. A dictionary or a JSON reponse:
{
  "assets":
  {
    "images":
    [
      "link of image1 on the page",
      "link of image2 on the page"
    ],
    "videos":
    [
      "link to embedded video1",
      "link to embedded video2"
    ]
  },
  "content":
  {
    "text": "all raw text from the page",
    "html": "all html from the page"
  }
}
  1. Or creates dedicated directories for the above keys of the dictionaries and actually saves the content to the respective directory.(Inspired from httrack)
  • [x] Another feature could be adding a specific scrape option. For Example: web.scrape(url, scrape_content = "images") returns all the links to images in or saves the images locally.

ba11b0y avatar Oct 05 '17 11:10 ba11b0y

Hey @invinciblycool Sounds good. Sounds like a great idea to start with. Go ahead. We can add more features. 🎉

ashwini0529 avatar Oct 05 '17 13:10 ashwini0529

@invinciblycool Add a TO-DO with your PR, and we will keep this issue alive until we feel satisfied. So that whenever someone gets a new idea on web-scraping, they can add to that TO-DO

shubhodeep9 avatar Oct 05 '17 13:10 shubhodeep9

Also, please add a [WIP] tag in your PR message. 😄

ashwini0529 avatar Oct 05 '17 13:10 ashwini0529

@ashwini0529 To start working if you could make it clear that should the function be returning a response or should create folders and save the content locally. Thanks. @shubhodeep9 Just confirming a TO-DO with the PR or the issue.

ba11b0y avatar Oct 05 '17 15:10 ba11b0y

Hey @invinciblycool you can take a look at the QR Code function. I think you can make something like that. Probable usage like what it was for QRCode: img = hackr.image.qrcode("https://github.com/pytorn/hackr", dest_path="/tmp/hackr_qrcode.png")

ashwini0529 avatar Oct 05 '17 16:10 ashwini0529

I guess then we agree on saving all the content locally. Will start working on it ASAP.

ba11b0y avatar Oct 06 '17 04:10 ba11b0y

Hey @invinciblycool Updates?

ashwini0529 avatar Oct 19 '17 10:10 ashwini0529

Sorry for the delay, I will try opening a PR by this week. Happy Diwali BTW. :sparkles:

ba11b0y avatar Oct 20 '17 09:10 ba11b0y

Perfect @invinciblycool Happy hacking and Happy Diwali! 😄 🎇

ashwini0529 avatar Oct 20 '17 09:10 ashwini0529