wswp
wswp copied to clipboard
Chapter 1 robots.txt not available, or sitemap.xml
Hi. I'm a beginner, with rusty Python skills, but I think their may be an issue here: These links don't display the text indicated in the book. They just display the home page:
- http://example.webscraping.com/robots.txt
- http://example.webscraping.com/sitemap.xml
Apologies if this is on the errata page. Haven't checked it yet. ...Just checked ...couldn't find the book to post it there.
Sitemap is something that is provided by the user which stores all links in one file. People may or may not provide the file, and the section uses specific examples to display that sometimes websites are simple enough to keep their data organized in the order of ID. You should use the mentioned links only which are provided in the example of the book.
Wow! Thank you for the quick response.
For clarification, those are the links provided in the book…I’m using the version that is available through Safari Online Bookshelf
Thanks,
Jodi
From: Nilesh Pandey [mailto:[email protected]] Sent: Thursday, December 21, 2017 4:04 PM To: kjam/wswp Cc: jadegrave; Author Subject: Re: [kjam/wswp] Chapter 1 robots.txt not available, or sitemap.xml (#6)
Sitemap is something that is provided by the user which stores all links in one file. People may or may not provide the file, and the section uses specific examples to display that sometimes websites are simple enough to keep their data organized in the order of ID. You should use the mentioned links only which are provided in the example of the book.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353469154 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMjQdli5A9PuhbppT8m1YOMGAznefmyWks5tCtXtgaJpZM4RKXRf . https://github.com/notifications/beacon/AMjQdntpxgV-BnbyduJDLRtqFpjV5nEWks5tCtXtgaJpZM4RKXRf.gif
This email has been checked for viruses by AVG. http://www.avg.com
Use this website for sitemap example
https://webscraping.com/sitemap.xml
Thank you! Very helpful.
Do you have one for the robots.txt?
From: Nilesh Pandey [mailto:[email protected]] Sent: Thursday, December 21, 2017 5:02 PM To: kjam/wswp Cc: jadegrave; Author Subject: Re: [kjam/wswp] Chapter 1 robots.txt not available, or sitemap.xml (#6)
Use this website for sitemap example
https://webscraping.com/sitemap.xml
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353479478 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMjQdu2FIBluu5rwx2R9Fqf1MY39KSsqks5tCuN7gaJpZM4RKXRf . https://github.com/notifications/beacon/AMjQdlKMYzpkXhKJffjwoO9yNfYX2YFWks5tCuN7gaJpZM4RKXRf.gif
This email has been checked for viruses by AVG. http://www.avg.com
I see that https://webscraping.com has a robots.txt, but it doesn't have the 'Bad crawler' and other items in it.
Hi all,
Unfortunately, not all links are still available and I am working with the original author to get the site back to perfect shape. In the meantime, this robots.txt is not as described in the book. Sorry about that and I hope you still enjoy working through the other examples!
-katharine
On Fri, Dec 22, 2017 at 1:01 AM jadegrave [email protected] wrote:
I see that https://webscraping.com has a robots.txt, but it doesn't have the 'Bad crawler' and other items in it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353488131, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUW2JyIi2ygQ2kX8avp0vXASsxNEUwmks5tCvE-gaJpZM4RKXRf .
the sample sitemap I am trying to scrape have (.gz) file extentions in it. How do I deal with such file types