joomla-cms icon indicating copy to clipboard operation
joomla-cms copied to clipboard

New component router can't parse old URLs

Open Bakual opened this issue 7 years ago • 73 comments

Steps to reproduce the issue

  • Install Joomla with testing data. Leave "Modern/Experimental" Router disabled.
  • Open the "Article Category List" menu item (/article-category-list.html). Leave that page open.
  • In a second tab go to the com_content Options in backend and set the "URL Routing" parameter (in "Integration" tab) to "Experimental".
  • Go back to the first tab with the open "Article Category List" and try some links there (open them in new tabs). Some of those "legacy" links will work, some (eg "Getting Started") will give you a 404.

Expected result

All links should still be parsed

Actual result

Only the "correct" links according to new router are parsed, the others are discarded.

System information (as much as possible)

Staging from 2017-03-21

Additional comments

The example article link "Getting Started" is generating a link /getting-started/19-sample-data-articles/joomla/22-getting-started.html currently with legacy router. This link actually is wrong and should be /getting-started.html since it is a direct match with a menu item. Our current code does that wrong and the new router would do it right. However our current code is able to parse both the wrong and correct URL just fine and give the expected page. The new router will break those "wrong" links, which imho is a quite major B/C break with a big impact on search engines and incoming links in general. You could argue that it's an option the admin has to enable, but that is only half of the truth. With 4.0, there will no such option anymore and we will face a lot of broken links after either enabling the router option or upgrading to 4.0. Both of which I think is unacceptable. There is also no real migration path.

Now what I would expect is the following:

  • New Router is always enabled
  • If an URL can't be parsed by the new router, there is some fallback code which tries with the legacy parsing rules
  • If legacy can parse the URL successfully, the wrong URL will be added to com_redirect with the new correct URL as target and a redirect 301 will be executed. This way, we don't loose any visitors and search engines will update their links.
  • If legacy can't parse it as well, a 404 is issued of course.
  • With 4.0, we can drop the fallback code if needed (personally I would give more time) or make it optional. Admins can then choose which "old" URLs they still want to be working by simply checking in com_redirect the existing redirects.

This way, we would have 100% B/C plus an easy migration path without loosing any incoming traffic.

Without that, I think we will face our next big Joomla drama when site owners realise the fancy new router will break some of their external links and Google Webmastertools starts listing a lot of 404.

Bakual avatar Mar 21 '17 21:03 Bakual

Isnt that why the new router is not the default and has a warning message? I would say that this is expected behaviour?

brianteeman avatar Mar 21 '17 23:03 brianteeman

It's a known issue and apparently expected behavior by the dev. But not by the user. And it certainly doesn't mean it is the correct thing to do.

Also, with 4.0 there will be no option and no warning anymore. Thus the admin has no real choice. He will have to break the links sooner or later and we don't help him find out which links that may be.

Seriously? That's our plan and expected behaviour? We can do better than that for sure.

Bakual avatar Mar 22 '17 06:03 Bakual

For reference, from https://developer.joomla.org/development-strategy.html#backward_compatibility

6.1.8 URLs

Any change to a URL that will give a 404 (or some other error) where it previously gave a 200 is a break in backwards compatibility. However, if the change results in a redirect to a new URL (which gives a 200) then that is acceptable.

In general, if a URL is changed then provided the new URL delivers the exact same resource rendered in the same way then that is not considered to be a break in backwards compatibility. For example, changing the order of the arguments in the query part of a URL is not considered to be a break.

chrisdavenport avatar Mar 22 '17 07:03 chrisdavenport

Any change to a URL that will give a 404 (or some other error) where it previously gave a 200 is a break in backwards compatibility

Not if the old URL was falsely 200, e.g. @Bakual 's example in the description. That's not a valid URL, current router return something valid which IS wrong! This wrong behaviour CANNOT be supported, that was one of the goals of the new router: to be a lot stricter than the loose one we currently have! My 2c

dgrammatiko avatar Mar 22 '17 09:03 dgrammatiko

Not if the old URL was falsely 200, e.g. @Bakual 's example in the description. That's not a valid URL, current router return something valid which IS wrong!

That statement isn't true. It wasn't falsely a 200. It was a valid URL generated by our current router and is correctly parsed and gives the expected result. So it is not the URL it should have generated but it is a valid URL. The current router doesn't return "something". It returns the correct and expected page.

This wrong behaviour CANNOT be supported, that was one of the goals of the new router: to be a lot stricter than the loose one we currently have!

I can live with that as the end goal (although I think it's stupid since site owners prefer visitors and not 404s), but I don't agree with doing that without any possibility for site owners to mitigate the effects of it.

Bakual avatar Mar 22 '17 10:03 Bakual

Educate people and then they will be fine. Tell them to create a sitemap of the old site, create another on when they'll upgrade to the new system. Then explain them how to connect the dots (map the old links to the new) The tools are widely available...

Similar to this ~~problem~~ is the UX improvement task of the back end. If we really want to improve (and not change some colours or some paddings) then we will end up with different workflows (that end users can't even imagine, therefore user surveys are useless). But then again I might be wrong on both, time will tell...

dgrammatiko avatar Mar 22 '17 10:03 dgrammatiko

Educate people and then they will be fine. Tell them to create a sitemap of the old site, create another on when they'll upgrade to the new system. Then explain them how to connect the dots (map the old links to the new) The tools are widely available...

Seriously??!! That's the recommended solution? Wow...

Backend is another topic. Changing workflows is fine if it is an improvement. That's not similar at all.

Bakual avatar Mar 22 '17 10:03 Bakual

Have to agree that your suggestion is not a solution at all - it might be just about ok on a site with just a few pages (although that site probably wont be effected anyway) but its completely impractical to suggest to do that on a site with even a few hundred pages - never mind one with thousands

brianteeman avatar Mar 22 '17 10:03 brianteeman

@brianteeman I'm guessing here that anyone that wants to move to the new router (is not forced to do so) understands the impact of that change.

dgrammatiko avatar Mar 22 '17 10:03 dgrammatiko

(is not forced to do so)

We will be forcing it with 4.0. It's not optional at all.

Bakual avatar Mar 22 '17 11:03 Bakual

Any plan which mandates that the current broken URLs that get accepted by the routing system is in my eyes not a valid plan. By that logic I can craft the URL of https://www.joomla.org/announcements/6-joomla-leadership-team.html which results in a 200 response, gives me exactly the body content that I'm looking for (even if it is now wrapped an the incorrect category/menu configuration), and therefore by your argument must continue to work or automagically redirect. Even funnier is this isn't a URL that will ever get generated within the Joomla application but if you know anything about how wonky the current router is you know exactly how to craft URLs in such a way to get mixed pages like this which just work.

Sooner or later we have to cut the technical debt and we have to address some of the underlying issues users have with the routing system. One of the most frequent groans is people manage to get "duplicate content" because there are a plethora of URLs you can use to get to a page if you know what you're doing (https://www.joomla.org/component/content/category/6-joomla-leadership-team.html is another perfectly valid mutation of the leadership page but again wrapped with the wrong menu data). We need to stop having a system that allows you to mutate the URL structure and land on a valid page, this system moves in that direction.

Yes, it does mean that users will require additional education and additional work to validate their links. Yes, I get this is not optimal user experience. But short of always supporting routing what are very obviously FUBAR URLs to the right content within our code, there is no fix for that.

mbabker avatar Mar 22 '17 12:03 mbabker

for the record i have absolutely no issue with making urls that "work" today but cannot be "generated" by Joomla no longer work

brianteeman avatar Mar 22 '17 12:03 brianteeman

@mbabker Michael, I'm not saying to keep the old URLs working forever. I just want to have a way site admins realistically can redirect the old URLs to the new ones without having to manually add all of them.

One of the most frequent groans is people manage to get "duplicate content" because there are a plethora of URLs you can use to get to a page

That's actually a misunderstanding from the people about what "duplicate content" is. Google has no issue with multiple links pointing to the same content as long as it's on the same domain. But as said, it's fine for me to get there where only one valid URL exists for a given page. I just don't agree on the path which is currently taken to get there (because there is no path).

But short of always supporting routing what are very obviously FUBAR URLs to the right content within our code, there is no fix for that.

There is a way to temporary keep supporting the "FUBAR URLs", collecting them and leave it to the admin to decide which to drop and which to keep after the legacy support has been dropped (eg in 4.0).

Bakual avatar Mar 22 '17 12:03 Bakual

As i see it there is no problem with joomla 3.7 as the new router is optional. Can an official joomla link migrator be developed for joomla 4? I 'm certainly no expert to say if it is possible with our current router mistakes, but if it is possible, could a link migrator automatically create 301 redirects with our component redirect???

peteruoi avatar Mar 22 '17 12:03 peteruoi

They could not be collected and an admin be told there are URLs not valid with the new system. That's not how it works and trying to do that WOULD be a B/C break. To use com_redirect in that way would require throwing a 404 on what is currently a URL responding with a 200. Or you are suggesting to just automagically dump all valid legacy URLs into com_redirect with zero notification to anyone (which would be a massive change in behavior and user expectation because right now the component only collects 404 URLs or has items that are manually input).

mbabker avatar Mar 22 '17 12:03 mbabker

but if it is possible, could a link migrator automatically create 301 redirects with our component redirect???

That's what I suggested in the initial issue description. But done in 3.7 and not in 4.0.

which would be a massive change in behavior and user expectation because right now the component only collects 404 URLs or has items that are manually input

Yes, it would be a change in behaviour since we collect the 404s before they happen, at a time where we actually still could say what the correct target is. If you see an issue with that, make it optional. I don't see that as an issue.

Bakual avatar Mar 22 '17 13:03 Bakual

The link migrator can't be done. Because there isn't a master list of all the URLs a site is accepting anywhere thanks to the glorious FUBAR behaviors of the current router, which as demonstrated allows you to mutate URLs (or in some cases will build them itself because of the oh so glorious FUBAR routing system) which results in "expected" content being displayed incorrectly. So a collection of bad URLs can only be compiled at runtime. Which by system behavior means that the URLs must 404 before they will automatically be collected into the redirect component or we will be introducing new black magic behaviors into a component and no notification to site owners about this.

mbabker avatar Mar 22 '17 13:03 mbabker

sigh...

Bakual avatar Mar 22 '17 13:03 Bakual

What you need to do is to check if the URL could be created by the system. If someone fooled the system and created a URL that works because of a bug/simplification then it is ok when this URL doesn't work in the new system. This will be the only a low number, but we need a solution for the majority of old URLs for a period of time.

rdeutz avatar Mar 22 '17 14:03 rdeutz

Sorry, I am not a coding specialist in case of Joomla, but could something like that help to solve the problem: a crawler, which automatically crawls all pages to get even those false correct pages, take the results and gives the correct rewrites?

franzpeter avatar Mar 22 '17 14:03 franzpeter

The only problem would be how to detect those false correct pages, it would need to crawl twice. First with the standard router, give the experimental router the results to try to route and if 404, detect that it needs a rewrite.

franzpeter avatar Mar 22 '17 14:03 franzpeter

https://github.com/wilsonge/joomla-cms/tree/com-router-legacy-rule This rule will parse legacy URLs with the new structure (however it does not validate intermediate segments - this means that /getting-started/19-sample-data-articles/joomla/22-getting-started.html parses, but so does /getting-started/19-sample-data-articles/lalalalalalalal/joomla/22-getting-started.html which from my discussions with the SEO team was one of Joomla's biggest routing issues from an SEO perspective.

It's only com_content by example and doesn't do the redirect logic - but I'm sure you guys can figure out how to do the redirect logic and whilst each router treats this kind of link specially you can figure out how to make it work :)

wilsonge avatar Mar 22 '17 22:03 wilsonge

Last night I was thinking about an approach where we would add a temporary argument $forceLegacy to JRouter::parse() which would then override the legacy/experimental parameter setting. With that, we could put code into the redirect plugin which in case of a 404 would try to parse the route again with that $forceLegacy enabled. If that parse results in a valid URL, it would do the redirect and add the entry to the com_redirect table. Next time that URL is called, the regular redirect function would take care of it. We can of course add a new parameter to the plugin to control that behaviour.

This way, the code would be in a central place and no coupling of the component routers to com_redirect.

Bakual avatar Mar 23 '17 07:03 Bakual

It's cleaner but you can't do it as a temporary measure and keep the interface

wilsonge avatar Mar 23 '17 08:03 wilsonge

I probably don't understand the sentence. With temporary I mean we could add that argument with 3.7 and deprecate it right away again for 4.0 when the legacy routing is removed (the argument is at least useless at that point).

Bakual avatar Mar 23 '17 08:03 Bakual

https://github.com/joomla/joomla-cms/blob/staging/libraries/cms/component/router/interface.php As in you'd need to break this interface. Which would mean extensions couldn't have an implementation that supports J3 and J4 at the same time

wilsonge avatar Mar 23 '17 08:03 wilsonge

You can't add it to the interface, that's true. But as far as I know the component routers could have that additional optional parameter both in J3 and J4. It would still satisfy the interface. In J4 it will just be a useless argument which will be never called.

Bakual avatar Mar 23 '17 10:03 Bakual

Ahh I didn't think you could. But we're doing that in JTable so I'm wrong. That could work

wilsonge avatar Mar 23 '17 10:03 wilsonge

Dear colleagues - thank's for raising all these issues. At the code sprint on Monday, the SEO team also met in Amsterdam, discussing with developers about the router issues raised. We have heard you all and we are reading what you write.

We are currently in the process of creating a document and a video with a project example (which we hope will be done by mid to end next week). We are also going to address how it needs fixing, why it needs fixing and what additional router features we would like to see from a technical SEO perspective.

We hope, that everyone can see the good in the community starting this process and we understand that there is guidance and information required from us.

Please give us the time to provide you with what we feel is needed. Let's all help in moving this forward.

And again: doing SEO for a living, I cannot even begin to tell you, how glad I am that we start working on these issues!

Kind regards Christopher Wagner Team Lead Joomla Optimization Team

chriswagner0815 avatar Mar 23 '17 11:03 chriswagner0815

Closing at this time https://developer.joomla.org/news/674-statement-about-the-new-router-feature-for-3-7-0.html

brianteeman avatar Mar 23 '17 20:03 brianteeman