Multiple sitemap.txt for packages
The current sitemap handler has the following comment:
// Google wants the return page to have < 50,000 entries and be less than
// 50MB - https://support.google.com/webmasters/answer/183668?hl=en
// As of 2018-01-01, the return page is ~3,000 entries and ~140KB
// By restricting to packages that have been updated in the last two years,
// the count is closer to ~1,500
As of today, it has 12581 entries, its size is ~567K. We should probably shard it by the first letter of the package name or something similar.
Yes, but let's shard by sha256 of package name, and update comments to reference any specs we can find on this subject, so far I found:
-
robot.txt:https://tools.ietf.org/html/draft-rep-wg-topic-00 - Sitemap protocol: https://www.sitemaps.org/protocol.html
Yes, but let's shard by sha256 of package name
That means we should probably add a new field to Package?
Why? we can compute this fairly cheap can't we?
Oh, I see we can't query by a computed property :)
Adding an index for this seems overkill.. let's discuss then when we get there. For now there is no rush to fix this.
https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap#general-guidelines