wswp icon indicating copy to clipboard operation
wswp copied to clipboard

Chapter 1 robots.txt not available, or sitemap.xml

Open jadegrave opened this issue 7 years ago • 7 comments

Hi. I'm a beginner, with rusty Python skills, but I think their may be an issue here: These links don't display the text indicated in the book. They just display the home page:

  • http://example.webscraping.com/robots.txt
  • http://example.webscraping.com/sitemap.xml

Apologies if this is on the errata page. Haven't checked it yet. ...Just checked ...couldn't find the book to post it there.

jadegrave avatar Dec 21 '17 21:12 jadegrave

Sitemap is something that is provided by the user which stores all links in one file. People may or may not provide the file, and the section uses specific examples to display that sometimes websites are simple enough to keep their data organized in the order of ID. You should use the mentioned links only which are provided in the example of the book.

nile649 avatar Dec 21 '17 22:12 nile649

Wow! Thank you for the quick response.

For clarification, those are the links provided in the book…I’m using the version that is available through Safari Online Bookshelf

Thanks,

Jodi

From: Nilesh Pandey [mailto:[email protected]] Sent: Thursday, December 21, 2017 4:04 PM To: kjam/wswp Cc: jadegrave; Author Subject: Re: [kjam/wswp] Chapter 1 robots.txt not available, or sitemap.xml (#6)

Sitemap is something that is provided by the user which stores all links in one file. People may or may not provide the file, and the section uses specific examples to display that sometimes websites are simple enough to keep their data organized in the order of ID. You should use the mentioned links only which are provided in the example of the book.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353469154 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMjQdli5A9PuhbppT8m1YOMGAznefmyWks5tCtXtgaJpZM4RKXRf . https://github.com/notifications/beacon/AMjQdntpxgV-BnbyduJDLRtqFpjV5nEWks5tCtXtgaJpZM4RKXRf.gif


This email has been checked for viruses by AVG. http://www.avg.com

jadegrave avatar Dec 21 '17 22:12 jadegrave

Use this website for sitemap example

https://webscraping.com/sitemap.xml

nile649 avatar Dec 21 '17 23:12 nile649

Thank you! Very helpful.

Do you have one for the robots.txt?

From: Nilesh Pandey [mailto:[email protected]] Sent: Thursday, December 21, 2017 5:02 PM To: kjam/wswp Cc: jadegrave; Author Subject: Re: [kjam/wswp] Chapter 1 robots.txt not available, or sitemap.xml (#6)

Use this website for sitemap example

https://webscraping.com/sitemap.xml

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353479478 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMjQdu2FIBluu5rwx2R9Fqf1MY39KSsqks5tCuN7gaJpZM4RKXRf . https://github.com/notifications/beacon/AMjQdlKMYzpkXhKJffjwoO9yNfYX2YFWks5tCuN7gaJpZM4RKXRf.gif


This email has been checked for viruses by AVG. http://www.avg.com

jadegrave avatar Dec 21 '17 23:12 jadegrave

I see that https://webscraping.com has a robots.txt, but it doesn't have the 'Bad crawler' and other items in it.

jadegrave avatar Dec 22 '17 00:12 jadegrave

Hi all,

Unfortunately, not all links are still available and I am working with the original author to get the site back to perfect shape. In the meantime, this robots.txt is not as described in the book. Sorry about that and I hope you still enjoy working through the other examples!

-katharine

On Fri, Dec 22, 2017 at 1:01 AM jadegrave [email protected] wrote:

I see that https://webscraping.com has a robots.txt, but it doesn't have the 'Bad crawler' and other items in it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353488131, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUW2JyIi2ygQ2kX8avp0vXASsxNEUwmks5tCvE-gaJpZM4RKXRf .

kjam avatar Dec 24 '17 16:12 kjam

the sample sitemap I am trying to scrape have (.gz) file extentions in it. How do I deal with such file types

MayaMalkoti avatar Jan 15 '20 08:01 MayaMalkoti