joomla-cms
joomla-cms copied to clipboard
New component router can't parse old URLs
Steps to reproduce the issue
- Install Joomla with testing data. Leave "Modern/Experimental" Router disabled.
- Open the "Article Category List" menu item (/article-category-list.html). Leave that page open.
- In a second tab go to the com_content Options in backend and set the "URL Routing" parameter (in "Integration" tab) to "Experimental".
- Go back to the first tab with the open "Article Category List" and try some links there (open them in new tabs). Some of those "legacy" links will work, some (eg "Getting Started") will give you a 404.
Expected result
All links should still be parsed
Actual result
Only the "correct" links according to new router are parsed, the others are discarded.
System information (as much as possible)
Staging from 2017-03-21
Additional comments
The example article link "Getting Started" is generating a link /getting-started/19-sample-data-articles/joomla/22-getting-started.html
currently with legacy router. This link actually is wrong and should be /getting-started.html
since it is a direct match with a menu item. Our current code does that wrong and the new router would do it right.
However our current code is able to parse both the wrong and correct URL just fine and give the expected page.
The new router will break those "wrong" links, which imho is a quite major B/C break with a big impact on search engines and incoming links in general.
You could argue that it's an option the admin has to enable, but that is only half of the truth. With 4.0, there will no such option anymore and we will face a lot of broken links after either enabling the router option or upgrading to 4.0. Both of which I think is unacceptable. There is also no real migration path.
Now what I would expect is the following:
- New Router is always enabled
- If an URL can't be parsed by the new router, there is some fallback code which tries with the legacy parsing rules
- If legacy can parse the URL successfully, the wrong URL will be added to com_redirect with the new correct URL as target and a redirect 301 will be executed. This way, we don't loose any visitors and search engines will update their links.
- If legacy can't parse it as well, a 404 is issued of course.
- With 4.0, we can drop the fallback code if needed (personally I would give more time) or make it optional. Admins can then choose which "old" URLs they still want to be working by simply checking in com_redirect the existing redirects.
This way, we would have 100% B/C plus an easy migration path without loosing any incoming traffic.
Without that, I think we will face our next big Joomla drama when site owners realise the fancy new router will break some of their external links and Google Webmastertools starts listing a lot of 404.
Isnt that why the new router is not the default and has a warning message? I would say that this is expected behaviour?
It's a known issue and apparently expected behavior by the dev. But not by the user. And it certainly doesn't mean it is the correct thing to do.
Also, with 4.0 there will be no option and no warning anymore. Thus the admin has no real choice. He will have to break the links sooner or later and we don't help him find out which links that may be.
Seriously? That's our plan and expected behaviour? We can do better than that for sure.
For reference, from https://developer.joomla.org/development-strategy.html#backward_compatibility
6.1.8 URLs
Any change to a URL that will give a 404 (or some other error) where it previously gave a 200 is a break in backwards compatibility. However, if the change results in a redirect to a new URL (which gives a 200) then that is acceptable.
In general, if a URL is changed then provided the new URL delivers the exact same resource rendered in the same way then that is not considered to be a break in backwards compatibility. For example, changing the order of the arguments in the query part of a URL is not considered to be a break.
Any change to a URL that will give a 404 (or some other error) where it previously gave a 200 is a break in backwards compatibility
Not if the old URL was falsely 200, e.g. @Bakual 's example in the description. That's not a valid URL, current router return something valid which IS wrong! This wrong behaviour CANNOT be supported, that was one of the goals of the new router: to be a lot stricter than the loose one we currently have! My 2c
Not if the old URL was falsely 200, e.g. @Bakual 's example in the description. That's not a valid URL, current router return something valid which IS wrong!
That statement isn't true. It wasn't falsely a 200. It was a valid URL generated by our current router and is correctly parsed and gives the expected result. So it is not the URL it should have generated but it is a valid URL. The current router doesn't return "something". It returns the correct and expected page.
This wrong behaviour CANNOT be supported, that was one of the goals of the new router: to be a lot stricter than the loose one we currently have!
I can live with that as the end goal (although I think it's stupid since site owners prefer visitors and not 404s), but I don't agree with doing that without any possibility for site owners to mitigate the effects of it.
Educate people and then they will be fine. Tell them to create a sitemap of the old site, create another on when they'll upgrade to the new system. Then explain them how to connect the dots (map the old links to the new) The tools are widely available...
Similar to this ~~problem~~ is the UX improvement task of the back end. If we really want to improve (and not change some colours or some paddings) then we will end up with different workflows (that end users can't even imagine, therefore user surveys are useless). But then again I might be wrong on both, time will tell...
Educate people and then they will be fine. Tell them to create a sitemap of the old site, create another on when they'll upgrade to the new system. Then explain them how to connect the dots (map the old links to the new) The tools are widely available...
Seriously??!! That's the recommended solution? Wow...
Backend is another topic. Changing workflows is fine if it is an improvement. That's not similar at all.
Have to agree that your suggestion is not a solution at all - it might be just about ok on a site with just a few pages (although that site probably wont be effected anyway) but its completely impractical to suggest to do that on a site with even a few hundred pages - never mind one with thousands
@brianteeman I'm guessing here that anyone that wants to move to the new router (is not forced to do so) understands the impact of that change.
(is not forced to do so)
We will be forcing it with 4.0. It's not optional at all.
Any plan which mandates that the current broken URLs that get accepted by the routing system is in my eyes not a valid plan. By that logic I can craft the URL of https://www.joomla.org/announcements/6-joomla-leadership-team.html which results in a 200 response, gives me exactly the body content that I'm looking for (even if it is now wrapped an the incorrect category/menu configuration), and therefore by your argument must continue to work or automagically redirect. Even funnier is this isn't a URL that will ever get generated within the Joomla application but if you know anything about how wonky the current router is you know exactly how to craft URLs in such a way to get mixed pages like this which just work.
Sooner or later we have to cut the technical debt and we have to address some of the underlying issues users have with the routing system. One of the most frequent groans is people manage to get "duplicate content" because there are a plethora of URLs you can use to get to a page if you know what you're doing (https://www.joomla.org/component/content/category/6-joomla-leadership-team.html is another perfectly valid mutation of the leadership page but again wrapped with the wrong menu data). We need to stop having a system that allows you to mutate the URL structure and land on a valid page, this system moves in that direction.
Yes, it does mean that users will require additional education and additional work to validate their links. Yes, I get this is not optimal user experience. But short of always supporting routing what are very obviously FUBAR URLs to the right content within our code, there is no fix for that.
for the record i have absolutely no issue with making urls that "work" today but cannot be "generated" by Joomla no longer work
@mbabker Michael, I'm not saying to keep the old URLs working forever. I just want to have a way site admins realistically can redirect the old URLs to the new ones without having to manually add all of them.
One of the most frequent groans is people manage to get "duplicate content" because there are a plethora of URLs you can use to get to a page
That's actually a misunderstanding from the people about what "duplicate content" is. Google has no issue with multiple links pointing to the same content as long as it's on the same domain. But as said, it's fine for me to get there where only one valid URL exists for a given page. I just don't agree on the path which is currently taken to get there (because there is no path).
But short of always supporting routing what are very obviously FUBAR URLs to the right content within our code, there is no fix for that.
There is a way to temporary keep supporting the "FUBAR URLs", collecting them and leave it to the admin to decide which to drop and which to keep after the legacy support has been dropped (eg in 4.0).
As i see it there is no problem with joomla 3.7 as the new router is optional. Can an official joomla link migrator be developed for joomla 4? I 'm certainly no expert to say if it is possible with our current router mistakes, but if it is possible, could a link migrator automatically create 301 redirects with our component redirect???
They could not be collected and an admin be told there are URLs not valid with the new system. That's not how it works and trying to do that WOULD be a B/C break. To use com_redirect in that way would require throwing a 404 on what is currently a URL responding with a 200. Or you are suggesting to just automagically dump all valid legacy URLs into com_redirect with zero notification to anyone (which would be a massive change in behavior and user expectation because right now the component only collects 404 URLs or has items that are manually input).
but if it is possible, could a link migrator automatically create 301 redirects with our component redirect???
That's what I suggested in the initial issue description. But done in 3.7 and not in 4.0.
which would be a massive change in behavior and user expectation because right now the component only collects 404 URLs or has items that are manually input
Yes, it would be a change in behaviour since we collect the 404s before they happen, at a time where we actually still could say what the correct target is. If you see an issue with that, make it optional. I don't see that as an issue.
The link migrator can't be done. Because there isn't a master list of all the URLs a site is accepting anywhere thanks to the glorious FUBAR behaviors of the current router, which as demonstrated allows you to mutate URLs (or in some cases will build them itself because of the oh so glorious FUBAR routing system) which results in "expected" content being displayed incorrectly. So a collection of bad URLs can only be compiled at runtime. Which by system behavior means that the URLs must 404 before they will automatically be collected into the redirect component or we will be introducing new black magic behaviors into a component and no notification to site owners about this.
sigh...
What you need to do is to check if the URL could be created by the system. If someone fooled the system and created a URL that works because of a bug/simplification then it is ok when this URL doesn't work in the new system. This will be the only a low number, but we need a solution for the majority of old URLs for a period of time.
Sorry, I am not a coding specialist in case of Joomla, but could something like that help to solve the problem: a crawler, which automatically crawls all pages to get even those false correct pages, take the results and gives the correct rewrites?
The only problem would be how to detect those false correct pages, it would need to crawl twice. First with the standard router, give the experimental router the results to try to route and if 404, detect that it needs a rewrite.
https://github.com/wilsonge/joomla-cms/tree/com-router-legacy-rule This rule will parse legacy URLs with the new structure (however it does not validate intermediate segments - this means that /getting-started/19-sample-data-articles/joomla/22-getting-started.html
parses, but so does /getting-started/19-sample-data-articles/lalalalalalalal/joomla/22-getting-started.html
which from my discussions with the SEO team was one of Joomla's biggest routing issues from an SEO perspective.
It's only com_content by example and doesn't do the redirect logic - but I'm sure you guys can figure out how to do the redirect logic and whilst each router treats this kind of link specially you can figure out how to make it work :)
Last night I was thinking about an approach where we would add a temporary argument $forceLegacy
to JRouter::parse()
which would then override the legacy/experimental parameter setting.
With that, we could put code into the redirect plugin which in case of a 404 would try to parse the route again with that $forceLegacy enabled. If that parse results in a valid URL, it would do the redirect and add the entry to the com_redirect table. Next time that URL is called, the regular redirect function would take care of it.
We can of course add a new parameter to the plugin to control that behaviour.
This way, the code would be in a central place and no coupling of the component routers to com_redirect.
It's cleaner but you can't do it as a temporary measure and keep the interface
I probably don't understand the sentence. With temporary I mean we could add that argument with 3.7 and deprecate it right away again for 4.0 when the legacy routing is removed (the argument is at least useless at that point).
https://github.com/joomla/joomla-cms/blob/staging/libraries/cms/component/router/interface.php As in you'd need to break this interface. Which would mean extensions couldn't have an implementation that supports J3 and J4 at the same time
You can't add it to the interface, that's true. But as far as I know the component routers could have that additional optional parameter both in J3 and J4. It would still satisfy the interface. In J4 it will just be a useless argument which will be never called.
Ahh I didn't think you could. But we're doing that in JTable so I'm wrong. That could work
Dear colleagues - thank's for raising all these issues. At the code sprint on Monday, the SEO team also met in Amsterdam, discussing with developers about the router issues raised. We have heard you all and we are reading what you write.
We are currently in the process of creating a document and a video with a project example (which we hope will be done by mid to end next week). We are also going to address how it needs fixing, why it needs fixing and what additional router features we would like to see from a technical SEO perspective.
We hope, that everyone can see the good in the community starting this process and we understand that there is guidance and information required from us.
Please give us the time to provide you with what we feel is needed. Let's all help in moving this forward.
And again: doing SEO for a living, I cannot even begin to tell you, how glad I am that we start working on these issues!
Kind regards Christopher Wagner Team Lead Joomla Optimization Team
Closing at this time https://developer.joomla.org/news/674-statement-about-the-new-router-feature-for-3-7-0.html