hackr
hackr copied to clipboard
Extending the web scraper.
There hasn't been much work on the web scraping part. I am interested to work on this. Since this is going to be a generic one, what I have thought as of now includes:
- A generic web scraper which scrapes all images, links and the text.
- Use scrapy for this maybe.
Still a beginner, any tips or corrections?
@invinciblycool I like the thought, I would suggest, a detailed list of missing components you find in the current code of scraper, then we will assign you the work.
@invinciblycool XML format could be added.
@ashwini0529 I have added the XML response to web.py. Let me know if any corrections are needed
@shubhodeep9 I will update the detailed list as soon as my exams get over :smile:
@ashwini0529 @shubhodeep9 Couldn't resist the excitement :smile: These are some features in my mind which can be added :
- [ ] If no JSON response is returned by the URL, only the source of the page is returned. We could have a more better scraper which returns either:
- A dictionary or a JSON reponse:
{
"assets":
{
"images":
[
"link of image1 on the page",
"link of image2 on the page"
],
"videos":
[
"link to embedded video1",
"link to embedded video2"
]
},
"content":
{
"text": "all raw text from the page",
"html": "all html from the page"
}
}
- Or creates dedicated directories for the above keys of the dictionaries and actually saves the content to the respective directory.(Inspired from httrack)
- [x] Another feature could be adding a specific scrape option.
For Example:
web.scrape(url, scrape_content = "images")returns all the links to images in or saves the images locally.
Hey @invinciblycool Sounds good. Sounds like a great idea to start with. Go ahead. We can add more features. 🎉
@invinciblycool Add a TO-DO with your PR, and we will keep this issue alive until we feel satisfied. So that whenever someone gets a new idea on web-scraping, they can add to that TO-DO
Also, please add a [WIP] tag in your PR message. 😄
@ashwini0529 To start working if you could make it clear that should the function be returning a response or should create folders and save the content locally. Thanks. @shubhodeep9 Just confirming a TO-DO with the PR or the issue.
Hey @invinciblycool you can take a look at the QR Code function. I think you can make something like that.
Probable usage like what it was for QRCode:
img = hackr.image.qrcode("https://github.com/pytorn/hackr", dest_path="/tmp/hackr_qrcode.png")
I guess then we agree on saving all the content locally. Will start working on it ASAP.
Hey @invinciblycool Updates?
Sorry for the delay, I will try opening a PR by this week. Happy Diwali BTW. :sparkles:
Perfect @invinciblycool Happy hacking and Happy Diwali! 😄 🎇