news-crawl icon indicating copy to clipboard operation
news-crawl copied to clipboard

Automatic removal of ephemeral sitemaps

Open sebastian-nagel opened this issue 3 years ago • 0 comments

If a news site creates sitemaps with unique URLs on a daily base (or even in shorter intervals), over time this leads to too many sitemaps checked for updates, causing that news articles get stuck in queues jammed with sitemaps. The unique URLs pointing to sitemaps can stem from the robots.txt or a sitemap index. Typical URL/file patterns for ephemeral sitemaps are caused by including:

  • a timestamp in many variations:
    .../sitemap.xml?yyyy=2020&mm=02&dd=07
    .../sitemap-2017.xml?mm=12&dd=31
    .../sitemap-2019-04.xml
    .../sitemap?type=clanky-2019_9
    .../sitemap-201910.xml
    .../sitemap-news.xml?y=2018&m=03&d=19
    .../02-Sep-2019.xml
    .../articles_2019_06.xml
    .../sitemap_30-Nov-2019.xml
    .../sitemap_bydate.xml?startTime=2020-02-16T00:00:00&endTime=2020-02-22T23:59:59
    
  • a consecutive number, random ID, UUID, hash, etc.
    .../sitemap.xml?page=1424
    .../ymox96xuveov.xml
    /sitemaps/1151jawjodn3t.xml
    
  • or a combination of the above or with a news category:
    .../2019-05-13-0058/0817_8.xml
    .../world.xml?section_id=338&content_type=1&year=2017&month=9
    .../2019-12-22/sitemap.xml?page=1409
    

In the worst case, there can be 100k or even millions of sitemaps tracked for a domain, which requires to manually block or clean up the list of sitemaps, in order to be able to fetch news articles and follow the recent sitemaps.

sebastian-nagel avatar Jul 24 '20 08:07 sebastian-nagel