sitemap-generator icon indicating copy to clipboard operation
sitemap-generator copied to clipboard

Including URLs that do not exist

Open tomjennings opened this issue 2 years ago • 0 comments

I have a static HTML site hosted on a google bucket. The resulting sitemap includes paths that simply do not exist on the site. Using the distributed run.py and specifying URL on the command line ('http://www.WEBSITE.com') it generates spurious URLs in the resulting sitemap, about one every 50 - 100 real-appearing URLs.

Example: There is no /Code* in the bucket (or source file tree).

Snippet of sitemap.xml output:

Screen Shot 2022-03-20 at 2 35 39 PM

Bucket contents in gcloud browser:

Screen Shot 2022-03-20 at 2 17 50 PM

... or via gsutil 'ls':

< /PATH/sitemap-generator: gsutil ls gs://www.WEBSITE.com gs://www.WEBSITE.com/AM-logo-350.png gs://www.WEBSITE.com/Rambler-logo-1920x1080.png gs://www.WEBSITE.com/Rambler-logo-350.png gs://www.WEBSITE.com/ads.txt gs://www.WEBSITE.com/botNav.txt gs://www.WEBSITE.com/contact.html gs://www.WEBSITE.com/dead-URL-help.html gs://www.WEBSITE.com/header-scripts.txt gs://www.WEBSITE.com/index.html gs://www.WEBSITE.com/keybase.txt gs://www.WEBSITE.com/robots.txt gs://www.WEBSITE.com/sr.css gs://www.WEBSITE.com/topNav.txt gs://www.WEBSITE.com/AMC/ /PATH/sitemap-generator:

In the source directory, there there are some 49,000 files and dirs;

< /PATH/www.WEBSITE.com: find . | wc -l 49134

Making for a ~ 2mb sitemap. Practically speaking most of those URLs are excluded, using this custom 'run.py' (changes only here):

# root_url = sys.argv[1] root_url = 'http://www.WEBSITE.com' crawler(root_url, out_file='WEBSITE.sitemap.xml', exclude_urls=["/Book/", "/PagesHi/", "/PagesLo/", ".mov", ".pdf", ".JPG", ".gif", ".jpg", ".zip"])

So my "real" sitemap is only 434 lines, about 37K bytes.

The problem occurs with the distribution exclude_urls or the custom one.

tomjennings avatar Mar 20 '22 21:03 tomjennings