news-please icon indicating copy to clipboard operation
news-please copied to clipboard

Merge articles spread on multiple pages

Open fhamborg opened this issue 7 years ago • 2 comments

Example: http://www.zeit.de/2016/18/ttip-barack-obama-hannover-usa-widerstand Under the given URL only the first part of the article is shown. A (human) reader can either click on a link that points to the second page or can click on "Auf einer Seite lesen" to read all on one page.

What will be the output of the current workflow? Ideally of course multiple pages should be identified and crawled as a single article. However, as this requires actual processing of the article, I expect the system to crawl this article as two articles? If so, is there any way to easily identify (e.g., during the actual article extraction performed by the km4 team) that two (or more) articles actually belong to only one?

Answer:

It depends on the crawler:

The sitemap and RSS crawler only find pages that are listed in the corresponding files. Thus, those crawlers only find the listed article, which might be the first page, all pages, the entire article or a combination.

The recursive crawlers on the other hand will find all pages as well as the entire article and, if the heuristics work for those, will save all of them.

For latter one, a possible way to identity if articles belong together is to search for commen text parts since all pages should be part of the entire article.

For both, it would be possible to extract URLs with keywords like "continue reading" or "page x" etc.

fhamborg avatar Dec 18 '16 17:12 fhamborg

seo compliant pages implement link rel net & pref, see https://support.google.com/webmasters/answer/1663744

fhamborg avatar Jul 16 '17 12:07 fhamborg

Do nothing. Paginated content is very common, and Google does a good job returning the most relevant results to users, regardless of whether content is divided into multiple pages.
Specify a View All page. Searchers commonly prefer to view a whole article or category on a single page. Therefore, if we think this is what the searcher is looking for, we try to show the View All page in search results. You can also add a rel="canonical" link to the component pages to tell Google that the View All version is the version you want to appear in search results.
Use rel="next" and rel="prev" links to indicate the relationship between component URLs. This markup provides a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page.

fhamborg avatar Oct 23 '17 13:10 fhamborg