wagtail-seo
wagtail-seo copied to clipboard
noindex/nofollow on a per-page basis
Opening an issue, following this #coderedcms thread.
Prerequisites
To understand better how noindex/follow work, this is a good starting point.
Goal
Ability to control on a per-page basis the presence of noindex
and nofollow
robot instructions.
How do I imagine it?
From the top of my mind, the most needed feature would first to be able to noindex
a page, in order to have a better control over what part of a website should be indexed by search engines.
The most basic approach would be to have a checkbox in the "SEO" tab, like the following:
- [ ] Exclude this page from being indexed by search engines
or another version for more technical users:
- [ ] "noindex" this page (using "
<meta>
" tag)
It would then output <meta name="robots" content="noindex">
in the head of the page. That's it.
Edit: Exclude no-indexed pages from the sitemap. See this comment.
Suggestions to make it better
- Let the user choose if the
noindex
directive should be applied via a tag in the , or via request headers. Could be a global setting (site-wide), or local to each page, not sure what would be best. - Add a similar checkbox to enable
nofollow
- Apply
noindex
ornofollow
on the current page, and all child pages too. Useful when for example you have a pure SEA marketing group of pages that you want to be isolated. For instancewww.company.com/lp/
, every child pages could havenoindex
preset, inheriting this setting from/lp/
. - A way to control which bots to target (tricky, and quite advanced I guess).
Would it make sense to also add the nofollow URLs into the robots.txt
?
@vsalvino I honestly don't know. I don't know what the SEO best practices look like in 2020. I personally like the modularity of the meta
tags in the head
. It makes it also simple to debug when you see everything in the page source, I don't imagine myself checking a robots.txt
file to know what's excluded. An SEO expert could help you on this, I'm not the right person :)
Would it make sense to also add the nofollow URLs into the
robots.txt
?
Yes it would. This still is used by Google and others to see what is allowed to index. So I would definitely include this as something which should be included, either in the same feature, or in a separate feature.
One more point I forgot to raise: the sitemap. Today we have a custom solution in place (but far from perfect) to generate the sitemap based on pages not having the noindex
flag. It would apparently be a bad practice to have in the sitemap pages that should not be indexed.
It means that we cannot use the Wagtail implementation to generate our sitemap (see https://docs.wagtail.io/en/v2.1.1/reference/contrib/sitemaps.html#basic-configuration). So it would be important to take that in consideration when working on this issue, I believe.
It would definitely make sense for us to provide a better sitemap, if Wagtail's is limiting. I'd be happy to review a PR @pierremanceaux if you would be willing to share your implementation?
Hey @vsalvino , here is what we have for now. Keep in mind that this code it 4 years old and probably needs some polishing, but hopefully it helps! ;)
View
@never_cache
def sitemap_view(request):
cache_key = 'wagtail-sitemap:' + str(request.site.id)
sitemap_xml = cache.get(cache_key)
if not sitemap_xml:
sitemap = Sitemap()
sitemap_xml = sitemap.render()
cache.set(cache_key, sitemap_xml, getattr(settings, 'WAGTAILSITEMAPS_CACHE_TIMEOUT', 6000))
response = HttpResponse(sitemap_xml)
response['Content-Type'] = "text/xml; charset=utf-8"
return response
Sitemap generation
class Sitemap(object):
EXCLUDED_TYPES = [
JobSinglePage
]
template = 'sitemap.xml'
@staticmethod
def _get_urls():
site = Site.objects.filter(is_default_site=True).select_related("root_page").get()
pages_qs = site.root_page.get_descendants(
inclusive=True
).live().public().exclude(basepage__seo_robot_meta__icontains="noindex").order_by('path')\
.specific()
for page in pages_qs.iterator():
# TODO: replace this by filtering this in the queryset
if type(page) in Sitemap.EXCLUDED_TYPES:
continue
for url in page.get_sitemap_urls():
yield url
def render(self):
return render_to_string(self.template, {
'urlset': self._get_urls()
})
Template
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% spaceless %}
{% for url in urlset %}
<url>
<loc>{{ url.location }}</loc>
{% if url.lastmod %}<lastmod>{{ url.lastmod|date:"Y-m-d" }}</lastmod>{% endif %}
<changefreq>weekly</changefreq>
</url>
{% endfor %}
{% endspaceless %}
</urlset>
Hi, the per page noindex and nofollow is super critical nowadays in SEO nowadays. SEOs mostly use robots.txt to block admin areas and disallow bots and not pages.
I hope this feature will be added to next release!
Has this been achieved?
Hi, checking in about 4 years later, wondering if this has been implemented? Or perhaps people are using some other solution?
This has not been a priority or a need for us. However, if someone is willing to implement it, including tests and docs, I would be willing to review and merge it.