MvcSiteMapProvider icon indicating copy to clipboard operation
MvcSiteMapProvider copied to clipboard

Requirements for New XML sitemap Endpoint

Open NightOwl888 opened this issue 9 years ago • 9 comments

I am considering jumping ahead and creating a new XML sitemap implementation that addresses several concerns that have been brought to my attention instead of waiting until the next major version release. This can be done by creating this functionality and leaving it disabled by default, but allowing it to replace the existing implementation using a configuration setting.

  1. The current design doesn't provide a way to provide alternate URLs that are not registered in the SiteMap. Since the current SiteMap only supports in the 10s of thousands of physical nodes, many people are using preservedRouteParameters to register only a subset of actual pages with the SiteMap. However, when using this technique there is currently no way to add all of the URLs to the Sitemap XML without resorting to a custom implementation.
  2. The current design doesn't provide support for any of the custom content types - see issue #138.
  3. The current design only has support for sending the output of the XML to the HTTP response. It should also be possible to output the XML as a file/set of files (think really, really, big sitemaps) by executing a server-side command.
  4. The current design relies on all of the elements of the XML to be loaded in memory at once (in a collection), rather than streaming the result from the source to the destination one node at a time. This limits scalability.
  5. The current design relies on loading up all of the nodes in order to get a total node count before deciding to load only the current page or current page index. In a nutshell, there is no paging support and there is no way for the developer to control which node goes on which page.
  6. The current design provides no way to override the default setting for how many maximum nodes to put into a single XML set and does not warn you if the XML size is larger than the maximum 50MB.
  7. The current design doesn't lend itself to DI very well.
  8. The naming conventions used in the current design are incorrect. The term is supposed to be "XML sitemap".

So to address these issues, my thought is to add a new provider model to address the lack of extensibility and scalability of this functionality. Feel free to chime in if there is something you would like to see listed here and I will update the list in this post as I get feedback.

  1. The design should support, but not rely on the existing SiteMap infrastructure to operate. This will likely entail factoring out some of the properties of ISiteMapNode (specifically the ones that deal with the sitemaps XML and resolving the URL) into a new interface that will be used by the new sitemaps XML functionality.
  2. Visibility providers should still be supported, but only as it applies to URLs that are part of a SiteMap.
  3. It should be possible to provide a different sitemaps XML stream for individual SiteMap instances or to combine the results of multiple SiteMap instances into a single sitemaps XML stream.
  4. The design should provide a way to use an automatic paging functionality or manual paging functionality so it can be driven by database or other data.
  5. The part that does automatic paging should provide a way to fetch record counts instead of actual records before deciding how to break into separate pages. The page size should be configurable.
  6. Individual records should be provided to the interface one at a time rather than in sets to allow for high scalability.
  7. The design should support adding additional content types to the XML schema. This should be a pluggable provider model that can support more content types as the Internet evolves.
  8. There should be a way to specify not to use custom content types for crawlers that do not support them (so crawlers that aren't programmed correctly can be supported without making them crash).
  9. All SiteMapNodeProvider implementations should support adding custom content types.
  10. There should be a way to execute a command to send the XML stream to a file.
  11. The webmaster should be notified if the byte size of the output goes over the maximum 50MB, or when using manual paging and the page size goes over the specified limit. I still haven't worked out exactly how this feedback should work.
  12. RSS format should also be supported for video feeds. Note that Bing supports video feeds in RSS format.

Anyway, I would appreciate your feedback as to what additional features (if any) you require and which features are most important to you.

NightOwl888 avatar Aug 09 '14 16:08 NightOwl888

@maartenba

I have made some progress on this. Namely, I have a prototype that:

  1. Can plug in specialized content types and write them out (including their namespace) into the same XML stream.
  2. Has a provider model with an automatic pager that dispatches paging instructions to specific provider(s) depending on the page requested. In other words, will efficiently page across multiple sets of URL data regardless of the size of those sets.
  3. Has an automatic index page that is displayed when no page number is specified and the total page count is greater than 1.
  4. Uses streaming (1 <url> node in memory at a time) all the way through from the provider to the output.

However, I have run into a snag. In order to justify putting the time into this it would be best if I could come up with some results. I have 2 legacy sites that don't currently have sitemaps XML that could benefit from this. However, neither of these projects uses MVC :(. One of them is currently a .NET 2.0 website (that I am planning to upgrade soon) and the other is a .NET 3.5 website.

Hence the dilemma: Should this functionality be separated out into its own project and NuGet package?

Pros:

  1. The sitemap functionality could be used by any ASP.NET website or even potentially background processing Windows Services.
  2. IMHO, this functionality never really belonged inside of MvcSiteMapProvider.

Cons:

  1. It would be difficult to integrate this functionality into MvcSiteMapProvider's ISiteMapNode interface and ISiteMapNodeProvider implementations without creating a dependency on this other project.
  2. Separating the functionality out likely means changing the constructors to have more than one overload as specified in DI-friendly library. External DI containers differ in which overload they choose by default, so they would require additional explicit DI code in our modules to wire this up.

Possible Options:

  1. Create sitemaps functionality as a separate project/NuGet package and add it as a dependency to MvcSiteMapProvider.
  2. Add sitemaps functionality to MvcSiteMapProvider and live with the fact that MVC is a dependency even though I don't need it in my projects (I could turn off the automatic launch of webactivator by specifying an external DI container in web.config). I could live with it, but others might not like this approach.
  3. Create sitemaps functionality as a separate project/NuGet package AND add it to MvcSiteMapProvider. In other words, maintain the same codebase in 2 places.
  4. Same as 3 except add another project to the MvcSiteMapProvider solution that contains linked code files to MvcSiteMapProvider so it actually shares the same code.
  5. Create sitemaps functionality as a separate project and add the dependency internally using ilMerge or adding it as a resource.
  6. Create a separate project/NuGet package and move the shared dependencies there, then make MvcSiteMapProvider and the sitemaps functionality into separate NuGet packages that can be installed independently of each other.

Sadly, 1, 2 and 3 are looking like the most attractive options because of the amount of work it would take to get from where I am now to there. 6 sounds great on the surface, but means there should be another package (a 4th package) to integrate the sitemaps functionality with MVC (url resolver, routes, ActionResult, and Controller) and make it automatically launch and interact with MvcSiteMapProvider.

Thoughts?

NightOwl888 avatar Aug 19 '14 09:08 NightOwl888

The reason it is in there is that, since MvcSiteMapProvider already knows all nodes of your web app, it should be easy to generate the sitemap XML. I've always thought of it as an optional feature: if you need other/better sitemap XML for search engines, either roll your own or find something more suitable on NuGet.

It could essentially be something that is not part of MvcSiteMapProvider, at all. A small bridge package could be created to integrate both.

  • GoogleSitemapXml package (or whatever it would be named)
  • GoogleSitemapXml.MvcSiteMapProviderBridge (would contain mapping from ISiteMapProvider to the package's format of generating them)
  • MvcSiteMapProvider.MVCx could depend on this bridge package so that the functionality is always there and pulls in the entire tree of dependencies.

maartenba avatar Aug 19 '14 10:08 maartenba

I looked at all of the sitemap NuGet packages already - not a single one exists that does not require MVC. Also, the only library that streams the output instead of loading it all up in memory at once requires you to put all of the data in the ASP.NET .sitemap format. Only one of these projects uses ASP.NET sitemaps and in that case it has 4 different .sitemap files, which that library doesn't have any support for.

MvcSiteMapProvider doesn't necessarily have the entire page structure of the site - specifically if the preservedRouteParameters property/attribute is used. That is one reason why I am going down this road - so there is a provider model to allow those URLs to be added by other means to address this shortcoming.

I like your idea, but unfortunately it doesn't address the entire problem.

To be complete, the sitemaps functionality will need to access some types that are currently in MvcSiteMapProvider.dll (namely, IUrlPath, UrlPath, IBinding, Binding, and all of the other types that support port bindings). Without that, the sitemaps functionality will have no way to resolve relative URLs and app relative URLs to absolute URLs which will need to include the protocol (and port number if it is not 80 or 443). IMHO resolving URLs is a core part of this process, so it should be included in the library or one of its dependencies rather than leaving it up to the consumer to decide how to deal with that.

Fortunately, the types can be moved to another DLL as long as they remain in the same namespace, but this basically means we are talking about doing item 6 from the list above. If we did this, I think it would be best to make another library to move the dependencies into than to make one depend on the other. That would allow them to be installed individually. But naming the library is a bit of a dilemma. The namespaces for the current types cannot change without breaking something, but starting the name of a library with MVC that doesn't have anything to do with MVC doesn't feel right.

Copying and pasting these types into the sitemaps library is also an option (in which case we could follow through with your idea), but doesn't feel right. However, this is looking like an option since the UrlPath contains a lot of garbage that isn't needed (including a reference to the IMvcContextFactory, which is a kludge) and the non-standard port resolving is an edge case.

Assuming we did go with item 6 above, the NuGet dependency chain would look like this.

  • MvcSiteMapProvider.Shared (no dependencies) - project for sharing types between libraries.
  • MvcSiteMapProvider.MVCx (depends on MVC 2, 3, 4, or 5 and MvcSiteMapProvider.Shared) - same as MvcSiteMapProvider.MVCx now.
  • SitemapXml (depends on MvcSiteMapProvider.Shared) - new library with sitemap XML functionality.
  • SitemapXml.MVC (depends on SitemapXml, MVC 1 or higher, and WebActivator) - new library with sitemap XML functionality for MVC such as URL resolving, Routing, Controller, and ActionResult integration.
  • MvcSiteMapProvider.MVCx.SitemapXml (depends on MvcSiteMapProvider.MVCx and SitemapXml.MVC) - contains configuration to make MvcSiteMapProvider interact with SitemapXml.MVC

Although I am not entirely sure MvcSiteMapProvider.MVCx.SitemapXml packages would be necessary because the MvcSiteMapProvider.MVCx and SitemapXml.MVC could potentially share the types in the MvcSiteMapProvider.Shared library to make them interact, this is just seeming like way too much extra work for the payoff. And that doesn't even really take into consideration the external DI aspect.

Do note that I am already planning to make a separate DLL that just contains the reference to WebActivator (MvcSiteMapProvider.WebActivator.dll) and the one PostApplicationStartMethodAttribute in the project and putting that DLL into the MvcSiteMapProvider.MVCx packages.

Using the MvcSiteMapProvider.MVC2 package in my legacy projects is (sadly) seeming like the best option. We could use your idea, but it will require some copy and paste of code and parallel maintenance.

As for naming, it would be most convenient to bring them all under the MvcSiteMapProvider umbrella. Except for the fact that I want to create a library that doesn't depend on MVC. But then, adding an MVC 2 reference to a .NET 3.5 project that doesn't already have one isn't likely to break anything, so maybe adding this dependency isn't as big of a deal as I think.

(shrug)

NightOwl888 avatar Aug 19 '14 13:08 NightOwl888

Hi,

I'm not sure if this would be a new feature, a bug, or me just misunderstanding SEO requirements. I have an MvcSiteMapNode for http://www.madebybees.co.uk/ but would like a canonicalUrl pointing to https://www.madebybees.co.uk (note http). As soon as I add anything to do with SEO canonicalization, the node is no longer put in my sitemap.xml

I can't see how I can use the canonicalization attributes in the MvcSiteMapProvider to allow for a single source of truth (and still have the node in the sitemap.xml).

canonical url... https://www.madebybees.co.uk

A user might navigate to...

http://www.madebybees.co.uk http://madebybees.co.uk https://www.madebybees.co.uk https:/madebybees.co.uk

Thanks, James

jxl98c avatar Sep 26 '14 08:09 jxl98c

The use case that the canonical URL functionality in MvcSiteMapProvider was designed for was to allow 2 different nodes (mapping to 2 different controller actions) to share the same content. This allows for users to navigate from 2 different categories to the same page without using a duplicate URL (which would create ambiguity when selecting the "current" node). It expects that one node in the SiteMap will specify the key or URL of another node in the SiteMap (or URL from outside of MVC), but was not meant to cover cases where the exact same controller action is being served on multiple URLs.

Maybe there is a feature request here somewhere, but since specifying the canonical tag on pages that canonize themselves is something that can be fairly easily accomplished using regular MVC tools, and doing so does not conflict with the built-in functionality, I am not seeing the benefit of adding it to MvcSiteMapProvider. I guess there is one if you consider putting all of your SEO stuff in one place a benefit, but that is about it.

As for your use case, I would recommend using a 301 redirect from http://madebybees.co.uk to http://www.madebybees.co.uk rather than serving the same content on both because you should aim to ensure only one of them is linked to by users as well as being the only one in the search engine index. You should, if possible, also serve all of your indexable content on either HTTP or HTTPS and stick with the choice. The canonical tag is a tool to cover the cases where it is not practical, technically challenging, or not possible to use a 301 redirect (HTTP/HTTPS is sometimes one of those cases depending on the technical or business requirements of the site).

NightOwl888 avatar Sep 26 '14 19:09 NightOwl888

Hi,

Thank you for the speedy response.

I had put the 301's in place yesterday after I posted the comment so my content is really only indexed on https:// and www.

The main confusion I think was because I was seeing nodes on the normal sitemap but not on the sitemap.xml endpoint and spent ages searching around before I released what the issue was. I just thought I'd mention it in passing though.

Kind Regards,

James

Sent from Surface

From: NightOwl888 Sent: ‎Friday‎, ‎26‎ ‎September‎ ‎2014 ‎20‎:‎07 To: maartenba/MvcSiteMapProvider Cc: James Lawson

The use case that the canonical URL functionality in MvcSiteMapProvider was designed for was to allow 2 different nodes (mapping to 2 different controller actions) to share the same content. This allows for users to navigate from 2 different categories to the same page without using a duplicate URL (which would create ambiguity when selecting the "current" node). It expects that one node in the SiteMap will specify the key or URL of another node in the SiteMap (or URL from outside of MVC), but was not meant to cover cases where the exact same controller action is being served on multiple URLs.

Maybe there is a feature request here somewhere, but since specifying the canonical tag on pages that canonize themselves is something that can be fairly easily accomplished using regular MVC tools, and doing so does not conflict with the built-in functionality, I am not seeing the benefit of adding it to MvcSiteMapProvider. I guess there is one if you consider putting all of your SEO stuff in one place a benefit, but that is about it.

As for your use case, I would recommend using a 301 redirect from http://madebybees.co.uk to http://www.madebybees.co.uk rather than serving the same content on both because you should aim to ensure only one of them is linked to by users as well as being the only one in the search engine index. You should, if possible, also serve all of your indexable content on either HTTP or HTTPS and stick with the choice. The canonical tag is a tool to cover the cases where it is not practical, technically challenging, or not possible to use a 301 redirect (HTTP/HTTPS is sometimes one of those cases depending on the technical or business requirements of the site).

— Reply to this email directly or view it on GitHub.

jxl98c avatar Sep 27 '14 07:09 jxl98c

After contemplating this some more, the logic is definitely wrong for the XML sitemap - it currently excludes the node if any canonical URL or Key is configured. It should only exclude it if the canonical URL is different from the nodeURL (case sensitive).

However, it would be pointless to fix that if the canonical tag is not functioning the way you have configured it. You didn't make it very clear whether the canonical tag was functioning in the HTTP/HTTPS case. Was it? Also, could you post the node configuration you were using when you tried it?

NightOwl888 avatar Sep 27 '14 09:09 NightOwl888

Hi,

The canonical tag was functioning perfectly and was always visible within my respective page as https://www.madebybees.co.uk/…. but I was only using it for the http and www. SEO differences so now I've got the 301’s in place I have no need for the canonical URL and my configuration has reduced to…

<mvcSiteMap xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://mvcsitemap.codeplex.com/schemas/MvcSiteMap-File-4.0" xsi:schemaLocation="http://mvcsitemap.codeplex.com/schemas/MvcSiteMap-File-4.0 MvcSiteMapSchema.xsd">

Prior to this every node had the following also set (from memory)…

canonicalProtocol=”https”

canonicalHostname=”www.madebybees.co.uk”

canonicalUrl=[same relative path as node URL]

As soon as any of those tags appeared the node was removed from the sitemap.xml endpoint but the meta tag was bang on perfect. Thinking about it though, if I navigated to https://www.madebybees.co.uk/sitemap.xml I should have had those nodes appear because the canonical URL did match the current protocol and host right?. That wouldn't have fixed my imagined problem but is probably the issue I think you are thinking of.

If you need me to try anything then let me know and I can tool around with my site and get back to you with the results.

Thanks,

James

Sent from Surface

From: NightOwl888 Sent: ‎Saturday‎, ‎27‎ ‎September‎ ‎2014 ‎10‎:‎48 To: maartenba/MvcSiteMapProvider Cc: James Lawson

After contemplating this some more, the logic is definitely wrong for the XML sitemap - it currently excludes the node if any canonical URL or Key is configured. It should only exclude it if the canonical URL is different from the nodeURL (case sensitive).

However, it would be pointless to fix that if the canonical tag is not functioning the way you have configured it. You didn't make it very clear whether the canonical tag was functioning in the HTTP/HTTPS case. Was it? Also, could you post the node configuration you were using when you tried it?

— Reply to this email directly or view it on GitHub.

jxl98c avatar Sep 27 '14 10:09 jxl98c

So yeah, to summarise…

The canonical tag always worked and was correct in the page header

The node would never appear in the sitemap.xml, irrespective of match or otherwise.

Sent from Surface

From: NightOwl888 Sent: ‎Saturday‎, ‎27‎ ‎September‎ ‎2014 ‎10‎:‎48 To: maartenba/MvcSiteMapProvider Cc: James Lawson

After contemplating this some more, the logic is definitely wrong for the XML sitemap - it currently excludes the node if any canonical URL or Key is configured. It should only exclude it if the canonical URL is different from the nodeURL (case sensitive).

However, it would be pointless to fix that if the canonical tag is not functioning the way you have configured it. You didn't make it very clear whether the canonical tag was functioning in the HTTP/HTTPS case. Was it? Also, could you post the node configuration you were using when you tried it?

— Reply to this email directly or view it on GitHub.

jxl98c avatar Sep 27 '14 10:09 jxl98c