wagtail-seo icon indicating copy to clipboard operation
wagtail-seo copied to clipboard

noindex/nofollow on a per-page basis

Open pierremanceaux opened this issue 4 years ago • 10 comments

Opening an issue, following this #coderedcms thread.

Prerequisites

To understand better how noindex/follow work, this is a good starting point.

Goal

Ability to control on a per-page basis the presence of noindex and nofollow robot instructions.

How do I imagine it?

From the top of my mind, the most needed feature would first to be able to noindex a page, in order to have a better control over what part of a website should be indexed by search engines.

The most basic approach would be to have a checkbox in the "SEO" tab, like the following:

  • [ ] Exclude this page from being indexed by search engines

or another version for more technical users:

  • [ ] "noindex" this page (using "<meta>" tag)

It would then output <meta name="robots" content="noindex"> in the head of the page. That's it.

Edit: Exclude no-indexed pages from the sitemap. See this comment.

Suggestions to make it better

  • Let the user choose if the noindex directive should be applied via a tag in the , or via request headers. Could be a global setting (site-wide), or local to each page, not sure what would be best.
  • Add a similar checkbox to enable nofollow
  • Apply noindex or nofollow on the current page, and all child pages too. Useful when for example you have a pure SEA marketing group of pages that you want to be isolated. For instance www.company.com/lp/, every child pages could have noindex preset, inheriting this setting from /lp/.
  • A way to control which bots to target (tricky, and quite advanced I guess).

pierremanceaux avatar Oct 01 '20 15:10 pierremanceaux

Would it make sense to also add the nofollow URLs into the robots.txt?

vsalvino avatar Oct 01 '20 15:10 vsalvino

@vsalvino I honestly don't know. I don't know what the SEO best practices look like in 2020. I personally like the modularity of the meta tags in the head. It makes it also simple to debug when you see everything in the page source, I don't imagine myself checking a robots.txt file to know what's excluded. An SEO expert could help you on this, I'm not the right person :)

pierremanceaux avatar Oct 01 '20 15:10 pierremanceaux

Would it make sense to also add the nofollow URLs into the robots.txt?

Yes it would. This still is used by Google and others to see what is allowed to index. So I would definitely include this as something which should be included, either in the same feature, or in a separate feature.

moojen avatar Oct 07 '20 08:10 moojen

One more point I forgot to raise: the sitemap. Today we have a custom solution in place (but far from perfect) to generate the sitemap based on pages not having the noindex flag. It would apparently be a bad practice to have in the sitemap pages that should not be indexed.

It means that we cannot use the Wagtail implementation to generate our sitemap (see https://docs.wagtail.io/en/v2.1.1/reference/contrib/sitemaps.html#basic-configuration). So it would be important to take that in consideration when working on this issue, I believe.

pierremanceaux avatar Nov 23 '20 13:11 pierremanceaux

It would definitely make sense for us to provide a better sitemap, if Wagtail's is limiting. I'd be happy to review a PR @pierremanceaux if you would be willing to share your implementation?

vsalvino avatar Nov 23 '20 16:11 vsalvino

Hey @vsalvino , here is what we have for now. Keep in mind that this code it 4 years old and probably needs some polishing, but hopefully it helps! ;)

View

@never_cache
def sitemap_view(request):
    cache_key = 'wagtail-sitemap:' + str(request.site.id)
    sitemap_xml = cache.get(cache_key)

    if not sitemap_xml:
        sitemap = Sitemap()
        sitemap_xml = sitemap.render()

        cache.set(cache_key, sitemap_xml, getattr(settings, 'WAGTAILSITEMAPS_CACHE_TIMEOUT', 6000))

    response = HttpResponse(sitemap_xml)
    response['Content-Type'] = "text/xml; charset=utf-8"

    return response

Sitemap generation

class Sitemap(object):
    EXCLUDED_TYPES = [
        JobSinglePage
    ]
    template = 'sitemap.xml'

    @staticmethod
    def _get_urls():
        site = Site.objects.filter(is_default_site=True).select_related("root_page").get()
        pages_qs = site.root_page.get_descendants(
            inclusive=True
        ).live().public().exclude(basepage__seo_robot_meta__icontains="noindex").order_by('path')\
            .specific()

        for page in pages_qs.iterator():
            # TODO: replace this by filtering this in the queryset
            if type(page) in Sitemap.EXCLUDED_TYPES:
                continue
            for url in page.get_sitemap_urls():
                yield url

    def render(self):
        return render_to_string(self.template, {
            'urlset': self._get_urls()
        })

Template

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% spaceless %}
{% for url in urlset %}
  <url>
    <loc>{{ url.location }}</loc>
    {% if url.lastmod %}<lastmod>{{ url.lastmod|date:"Y-m-d" }}</lastmod>{% endif %}
    <changefreq>weekly</changefreq>
   </url>
{% endfor %}
{% endspaceless %}
</urlset>

pierremanceaux avatar Nov 23 '20 16:11 pierremanceaux

Hi, the per page noindex and nofollow is super critical nowadays in SEO nowadays. SEOs mostly use robots.txt to block admin areas and disallow bots and not pages.

I hope this feature will be added to next release!

anefta avatar Nov 30 '20 02:11 anefta

Has this been achieved?

benlamptey-gocity avatar Jun 22 '22 10:06 benlamptey-gocity

Hi, checking in about 4 years later, wondering if this has been implemented? Or perhaps people are using some other solution?

gideonaa avatar Sep 13 '24 02:09 gideonaa

This has not been a priority or a need for us. However, if someone is willing to implement it, including tests and docs, I would be willing to review and merge it.

vsalvino avatar Sep 13 '24 16:09 vsalvino