hugo
hugo copied to clipboard
Sitemap exceeding 50K urls
What version of Hugo are you using (hugo version
)?
$ hugo version hugo v0.97.3+extended linux/amd64 BuildDate=unknown
Does this issue reproduce with the latest release?
Yes,
If you create a sitemap on a site with over 50K urls , Google complains that the file is too big.
Your Sitemap contains too many URLs. Please create multiple Sitemaps with up to 50000 URLs each and submit all Sitemaps.
I looked at the docs and noticed that there is no way to override this. Technically not a bug, but this makes it difficult to submit sitemap to Google.
See https://discourse.gohugo.io/t/feature-request-sitemapindex-for-sitemaps-with-50k-links/33214
HI @carerragt
Thanks for pointing in the right direction.
Please note davidsneighbour response on that thread.
Often requested, but technically not possible.
This means that there is a reasonable demand and need for this feature. I have a site with a single 'type'. I don't have categories,tags or other taxonomies. There are 79K url's in my site all belonging to the same type.
Hence, Ju52 proposal may not work with me as I don't have different sections on the site. I understand that this may not be the top priority but it is a problem worth fixing. I run hundred's of sites all in wordpress. Atleast 50%+ sites would have more than 50K url's.
Perhaps what you should be asking is how to split the sitemap list to multiple files. The 50k is a Google limitation, not Hugo's per se.
hi @carerragt
Sure, I understand.
It does beg the discussion that the very notion of having a sitemap is to submit to search engines. Without this need, there is no requirement for a sitemap. Both Google and Bing that provide consoles for managing the sitemap submission specifically request a sitemap that is chunked over 50K.
I would open a forum thread but if you solicit community feedback from those that have larger site, they will tell you that this might be a very important feature for them.
I also have a site with 50k+ pages in a single sitemap.
Adopting some kind of automatically splitting due an external limit in Hugo might break things for others. We should complain about the limit at the external search engine at first. Maybe those provider can up the limit to let's say 100k?
@midzer
You can certainly try but there is a rational for them to limit the file to 50K url being the size of the file. Try downloading a file that has 50K url and the size will be approx 4MB.
Furthermore, Google and Bing certainly don't need to change their processing pipelines because a static site gen decided that sitemap.xml shouldn't be split. If Hugo wants to be adopted, then the onus of adding features or making changes inline with industry expectation lies with Hugo and not other providers.
"sitemaps" protocol supports main "sitemap index" with many child "sitemaps" (50k each). Example:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml.gz</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>
Reference: https://www.sitemaps.org/protocol.html
For example, Amazon.com website used this in past, and they had millions pages. It seems they stopped this ;) perhaps because of billions pages to list?
And BTW Google understands this "index" file. You can put link to it in robots.txt
, it is not necessary to submit it explicitly.
Sitemaps are “nice to have” for SEO; no need to include pagination and taxonomy pages into sitemaps; and in this case (if we ignore generated pages) it is quite easy to write tool which will generate it as part of Hugo build (some JavaScript, run as part of Node command, etc.) - it could be part of custom theme.
Workaround:
- Let Hugo create default sitemap.xml
- Download it and split into multiple files, 50 URLs each
- Follow https://www.sitemaps.org/protocol.html and create necessary files accordingly, place it in "/static" folder
Note: "sitemaps" are needed for documents which are not reachable from "home"; or, which are not easily reachable. For example, huge websites such as Amazon are in need of sitemaps: the only other way to "reach" product is via search bar.
So, I don't think sitemaps are as so important for static websites as for E-Commerce... "categories" and "pagination" replace it.
As per documentation at https://gohugo.io/templates/sitemap-template/, we can explicitly use page front matter:
sitemap:
changeFreq: ""
disable: false
filename: sitemap-01.xml
priority: -1
Hugo also supports sitemapindex.xml
generation.
Simple script can traverse your tons of documents and insert sitemap-01.xml
for first 50,000, sitemap-02.xml
for 2nd, and so on. This is just workaround, but Hugo made huge progress since this ticket was initially created.
insert sitemap-01.xml for first 50,000, sitemap-02.xml
You are confusing site configuration with front matter override. You cannot override the filename in front matter. That's why the front matter override example in the documentation does not include filename
.
Ok, I didn't know that... but then, to confirm, we have sitemap-index feature, and we still don't have multi-index support? For now, I run local build which generates huge sitemap, then I split is manually & disable sitemap generation, then deploy sitemaps from "static" folder as workaround.
With a multilingual project we create one sitemap index, and individual sitemaps per language (site). Regardless of whether a project is monolingual or multilingual, we don't split sitemaps based on the number of entries.
That's why this issue is open.
I think it's relatively clear what this issue is about. If you want to discuss workarounds, use https://discourse.gohugo.io/
One workaround could be to add your own sitemap template to your theme/project:
https://github.com/gohugoio/hugo/blob/master/tpl/tplimpl/embedded/templates/_default/sitemap.xml
And possibly filter out your 50k most interesting URLs from a SEO perspective ...
I have 270k modern terminology dictionary, all English, why should I filter "most interesting" terms? My workaround it simple: let Hugo generate huge XML, then take scissors and cut it into 6 pieces; or just write Java application which will generate what I need and place it into "static" folder, I'll need an hour for that. Since it is too hard for Hugo ;)
Yes, multilingual support adds more complexity
Anyway, after some more thinking, sitemaps were invented for pages which are not reachable from homepage. For Hugo -based sites, sitemaps are not needed at all; but it is my personal opinion.
I love example with Amazon: they used sitemaps approx. ten years ago; but now, they don't. Perhaps they prefer to upload product listings in different specialized format to Google and other sites.
Sorry for writing too much, but continuing logically: I had a past "price comparison" site where product pages were reachable only from search results pages; it was nonsense to have "pagination" for such a huge site. So, I used sitemaps to explicitly generate URLs where I wanted the Search Engine to land.
It's important to note that sitemaps are not necessary for typical Articles or blog sites with a well-structured menu/submenu/pagination. They are only required in specific cases, such as the one I encountered: a site with a few hundred thousand products, accessible solely through the Search Bar. In such instances, Google may not discover these pages due to the lack of a link route from Home to Child to Sub-Child, and so on. Therefore, sitemaps are particularly useful for managing large sites. For instance, I disabled pagination for my 270k dictionary site; it's not user-friendly to paginate the letter 'K' with 1000 links on a page, spread across 20 pages. In such cases, sitemaps can help to streamline the user ("robot" lol) experience.
Therefore, in Hugo, the use case for sitemaps is only for huge sites where we are forced to disable pagination.
Some other non-Hugo use cases for sitemaps: SPA (Single-Page Application) which we want to made searchable; and etc.