pub-dev icon indicating copy to clipboard operation
pub-dev copied to clipboard

Multiple sitemap.txt for packages

Open isoos opened this issue 6 years ago • 5 comments

The current sitemap handler has the following comment:

  // Google wants the return page to have < 50,000 entries and be less than
  // 50MB -  https://support.google.com/webmasters/answer/183668?hl=en
  // As of 2018-01-01, the return page is ~3,000 entries and ~140KB
  // By restricting to packages that have been updated in the last two years,
  // the count is closer to ~1,500

As of today, it has 12581 entries, its size is ~567K. We should probably shard it by the first letter of the package name or something similar.

isoos avatar Sep 16 '19 08:09 isoos

Yes, but let's shard by sha256 of package name, and update comments to reference any specs we can find on this subject, so far I found:

  • robot.txt: https://tools.ietf.org/html/draft-rep-wg-topic-00
  • Sitemap protocol: https://www.sitemaps.org/protocol.html

jonasfj avatar Sep 16 '19 09:09 jonasfj

Yes, but let's shard by sha256 of package name

That means we should probably add a new field to Package?

isoos avatar Sep 16 '19 09:09 isoos

Why? we can compute this fairly cheap can't we?

jonasfj avatar Sep 16 '19 09:09 jonasfj

Oh, I see we can't query by a computed property :)

Adding an index for this seems overkill.. let's discuss then when we get there. For now there is no rush to fix this.

jonasfj avatar Sep 16 '19 09:09 jonasfj

https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap#general-guidelines

sigurdm avatar Aug 06 '24 07:08 sigurdm