Crawl old site to collect URLs and metadata
Description
Use a crawler to find all of the URLs that are linked within the site. This list will be the basis for planning the URL redirects, and also to QA the redirects. The crawler should also collect HTML Title, Meta Description, Canonical link tags.
Once the site has been crawled, the list of URLs should be approved by the website owner. It is possible that some content was not found by the crawl because it is not linked from anywhere in the site, but the owner still wants that content to be moved. Crawl the additional content; add the discovered URLs to the crawl list.
Success Criteria
- [ ] Crawl data for pre-migration website
- [ ] Website owner has approved that the list of website URLs is complete
Hi @a-kyne
all of the URLs that are linked from a site
as in, a) all of the outbound links (to other sites) on the crawled site's pages, or b) all the URLs that the crawled site can serve? We basically have a) available via www-site-checker and b) available via our sitemap generation (which can actually surface standalone/unlinked-to pages, too)
Hi @stevejalim
Updated to clarify, thank you.
The reason for using the spider to crawl the site is to mimic a search engine crawler. If you generate a list of published pages from the server/CMS/other type of tool, I'm not sure that you'd be able to collect the page metadata or to identify that there are pages which are orphaned.
Which site is the "old site"? This is in the bedrock repo. Is the old site www.mozilla.org?
The "old site" is the original location of the URLs that are being migrated to a new site.
For example, Firefox URLs on WMO are being moved to a new Firefox domain: We will need to document information about what the URLs on WMO were in order to test that the redirects to Firefox work as expected. (One of the common site migration errors is not having a complete record of what all of the original URLs were, so that the redirects cannot be tested.)
The crawl will also tell us where are all of the links on WMO that point to the Firefox pages, so that we can ensure that the link URLs are updated to point directly to the new pages on Firefox.com. (If you don't update your own links within your site, you are sending contradictory signals to search engines about whether the move is permanent/intentional.)
Post-migration, we should recrawl not just the old URLs that have been updated (to test the redirects) but the WMO site so that we can test whether there are bad WMO links (i.e. links that redirect more than once before they reach the new URL).